A self-labeling method for sparse autoencoders may be more accurate than current human-curated alternatives

Cameron Berg argues that having a model label its own activations produces more accurate SAE feature labels than existing approaches. External research corroborates the mechanism, and the implications for interpretability tooling are worth examining.

By · The Editor

Cameron Berg has a pointed critique of how sparse autoencoder features currently get labeled. The prevailing approach, associated with GoodFire, produces labels that Berg considers less accurate than what a different method can achieve. His alternative, which he calls the “selfie” method, has the model label its own activations rather than relying on external annotation. The result, as Berg describes it, is a way to bootstrap SAE labels that are meaningfully more accurate than the GoodFire baseline.

The mechanism Berg describes works by feeding activation signals back through the model and using that pathway to generate the labels themselves. This is a departure from methods that treat labeling as a post-hoc human or human-supervised task applied to already-extracted features. The claim is that the model, operating on its own internal states, is better positioned than an external process to assign meaningful descriptions to those states.

That claim finds external support in recent interpretability research. A paper posted to arXiv (2602.10352) describes training lightweight adapters on interpretability artifacts, specifically vector-label pairs, so that models produce reliable self-interpretation outputs. The methodology maps directly onto what Berg is describing: the model as an active participant in characterizing its own feature space, rather than a passive object of external labeling. A LessWrong post from August 2024 on self-explaining SAE features explored the same conceptual territory earlier, suggesting the idea had been circulating among interpretability researchers before it reached formal publication.

Allows you to bootstrap SAE labels, so that you can just have way more accurate labels on your SAE given basically having the model label its own activations. Cameron Berg

The significance here is not merely technical. SAE-based interpretability has been positioned as one of the more tractable paths toward understanding what is happening inside large language models at the feature level. If the labels attached to those features are systematically less accurate than they could be, then the downstream conclusions drawn from SAE analysis inherit that error. Better labels mean more reliable interpretability work, and more reliable interpretability work means that the safety and alignment claims built on top of it are better grounded.

GoodFire, the entity Berg names as a reference point for current labeling practice, has built tooling around SAE features as a product. Berg is not arguing that the approach is without value. He is arguing that the selfie method produces a better output on the specific dimension of label accuracy. That is a testable claim, and the external research record suggests the underlying mechanism is credible enough to warrant testing.

What remains open is the question of scale and generalization. Self-labeling via activations works in the conditions the research describes, but whether it holds across model families, across feature types, and across the range of activation patterns that SAEs are asked to cover is not yet settled. Berg’s framing is confident, and the corroborating work adds methodological weight. The gap between a promising mechanism and a reliable production alternative to existing labeling pipelines is still real, and closing it will require the kind of systematic evaluation that neither a single claim nor a single paper can provide.

The interpretability field is moving fast enough that the gap may close sooner than the pace of formal publication suggests. Berg’s claim, placed against the backdrop of the arXiv work and the earlier LessWrong discussion, looks less like an isolated opinion and more like a signal that the labeling problem in SAE interpretability is being actively reworked from the ground up.

❦

The Editor, for the readers of Signal Headquarters

A self-labeling method for sparse autoencoders may be more accurate than current human-curated alternatives

From the Archive

Each piece, in your inbox.