Architectural Effects on Superposition

Determine how specific architectural choices in neural networks (e.g., activation functions, normalization, layer design, and inductive biases) influence the prevalence, structure, and properties of superposition and the resulting monosemanticity or polysemanticity.

Background

The paper discusses how architectural and training choices can affect monosemanticity and interpretability, suggesting that design decisions may modulate the extent of superposition in practice.

Systematically characterizing architectural effects would guide the development of intrinsically more interpretable models without sacrificing performance, informing both intrinsic and post-hoc interpretability strategies.

References

Superposition also raises open questions like operationalizing computation in superposition \citep{vaintrob_mathematical_2024}, attention head superposition \citep{elhage_toy_2022,jermyn_circuits_2023,lieberum_does_2023,gould_successor_2023}, representing feature clusters \citep{elhage_toy_2022}, connections to adversarial robustness \citep{elhage_toy_2022}, anti-correlated feature organization \citep{elhage_toy_2022}, and architectural effects \citep{nanda_200superposition_2023}.

— Mechanistic Interpretability for AI Safety -- A Review (2404.14082 - Bereska et al., 22 Apr 2024) in Future Directions, Subsection 8.1 Clarifying Concepts — Corroborate or Refute Core Assumptions

Architectural Effects on Superposition

Sponsor

Background

References

Related Problems