Operationalizing Computation in Superposition

Develop a rigorous operational framework for computation under superposition in neural networks, where features are encoded as overlapping linear combinations of neurons, in order to formally specify how such superposed feature representations implement and support computation.

Background

Superposition is hypothesized to explain polysemantic neurons by allowing neural networks to represent more features than neurons through encoding features as overlapping directions in activation space. This compression complicates mechanistic analysis because computation must act on entangled feature directions rather than isolated monosemantic neurons.

A formal operationalization would clarify how computations are executed on superposed features, enabling principled analysis, interventions, and verifiable explanations of model behavior in settings where superposition is prevalent.

References

Superposition also raises open questions like operationalizing computation in superposition \citep{vaintrob_mathematical_2024}, attention head superposition \citep{elhage_toy_2022,jermyn_circuits_2023,lieberum_does_2023,gould_successor_2023}, representing feature clusters \citep{elhage_toy_2022}, connections to adversarial robustness \citep{elhage_toy_2022}, anti-correlated feature organization \citep{elhage_toy_2022}, and architectural effects \citep{nanda_200superposition_2023}.

— Mechanistic Interpretability for AI Safety -- A Review (2404.14082 - Bereska et al., 2024) in Future Directions, Subsection 8.1 Clarifying Concepts — Corroborate or Refute Core Assumptions

Operationalizing Computation in Superposition

Background

References

Related Problems