Establish the foundational validity of the superposition hypothesis

Determine whether the superposition hypothesis—which posits that neural networks represent more features than their activation dimensionality by sparse, approximately linear superposition—provides a fundamentally valid description of neural network representations or is merely a pragmatically useful approximation, and specify the conditions under which it holds.

Background

Much of contemporary mechanistic interpretability relies on the superposition hypothesis to justify overcomplete feature dictionaries and sparse activation assumptions. This conceptual foundation underpins SDL approaches and the idea of linearly represented concepts in activation space.

The paper argues that conceptual clarity is lacking and formal foundations are weak; therefore, it is not known whether superposition is fundamentally correct or just practically helpful. Resolving this would guide method design and evaluation, and influence whether intrinsically decomposable models are a better path forward.

References

Without solid conceptual foundations, it remains unclear whether the superposition hypothesis, which underpins the SDL paradigm, is fundamentally valid or merely pragmatically useful.

— Open Problems in Mechanistic Interpretability (2501.16496 - Sharkey et al., 27 Jan 2025) in Reverse engineering step 1: Neural network decomposition — Current decomposition methods lack solid theoretical foundations (Section 2.1.1, parasection “Current decomposition methods lack solid theoretical foundations”)

Establish the foundational validity of the superposition hypothesis

Background

References

Related Problems