Dice Question Streamline Icon: https://streamlinehq.com

Attention Head Superposition in Transformers

Investigate whether and how transformer attention heads exhibit superposition, and characterize the mechanisms, prevalence, and implications of attention-head-level superposition for interpretability and circuit analysis.

Information Square Streamline Icon: https://streamlinehq.com

Background

While superposition is often discussed at the neuron or feature level, transformers may also exhibit superposition at the level of entire attention heads, complicating efforts to attribute specific functions to individual heads.

Understanding attention head superposition would inform circuit-discovery methods, clarify redundancy and interference among heads, and guide architectural or training interventions aimed at increasing monosemanticity.

References

Superposition also raises open questions like operationalizing computation in superposition \citep{vaintrob_mathematical_2024}, attention head superposition \citep{elhage_toy_2022,jermyn_circuits_2023,lieberum_does_2023,gould_successor_2023}, representing feature clusters \citep{elhage_toy_2022}, connections to adversarial robustness \citep{elhage_toy_2022}, anti-correlated feature organization \citep{elhage_toy_2022}, and architectural effects \citep{nanda_200superposition_2023}.

Mechanistic Interpretability for AI Safety -- A Review (2404.14082 - Bereska et al., 22 Apr 2024) in Future Directions, Subsection 8.1 Clarifying Concepts — Corroborate or Refute Core Assumptions