Feature superposition, cooccurrence and output influence in LLMs: a sparse autoencoder-based analysis
There seems to be tension between various perspectives on superposition in neural networks, which was (and still kind of is) confusing to me.
(a) From Anthropic’s Toy Models of Superposition:
- We expect that features that covary up to some point will take orthogonal feature bases.
- Two features that do not cooccur will take similar feature bases (but in opposite directions) and are decoupled with the nonlinear activation function.
(b) On the other hand, from Anthropic’s Toward Monosemanticity:
- Features which affect the output in the same way will have similar feature bases.
In (a), features which cooccur need to take orthogonal feature bases to avoid interference. But we would also expect (and observe empirically in (b)) that when the activations are part of a larger circuit, features which cooccur may also have similar effects on the output, and thus can take similar feature bases; we don’t mind some interference if both features have the same downstream effect.
I attempted to resolve these phenomena. You can see this notebook for my analysis.