Establish the foundational validity of the superposition hypothesis
Determine whether the superposition hypothesis—which posits that neural networks represent more features than their activation dimensionality by sparse, approximately linear superposition—provides a fundamentally valid description of neural network representations or is merely a pragmatically useful approximation, and specify the conditions under which it holds.
References
Without solid conceptual foundations, it remains unclear whether the superposition hypothesis, which underpins the SDL paradigm, is fundamentally valid or merely pragmatically useful.
— Open Problems in Mechanistic Interpretability
(2501.16496 - Sharkey et al., 27 Jan 2025) in Reverse engineering step 1: Neural network decomposition — Current decomposition methods lack solid theoretical foundations (Section 2.1.1, parasection “Current decomposition methods lack solid theoretical foundations”)