Architectural Effects on Superposition
Determine how specific architectural choices in neural networks (e.g., activation functions, normalization, layer design, and inductive biases) influence the prevalence, structure, and properties of superposition and the resulting monosemanticity or polysemanticity.
References
Superposition also raises open questions like operationalizing computation in superposition \citep{vaintrob_mathematical_2024}, attention head superposition \citep{elhage_toy_2022,jermyn_circuits_2023,lieberum_does_2023,gould_successor_2023}, representing feature clusters \citep{elhage_toy_2022}, connections to adversarial robustness \citep{elhage_toy_2022}, anti-correlated feature organization \citep{elhage_toy_2022}, and architectural effects \citep{nanda_200superposition_2023}.