Attention Head Superposition in Transformers
Investigate whether and how transformer attention heads exhibit superposition, and characterize the mechanisms, prevalence, and implications of attention-head-level superposition for interpretability and circuit analysis.
References
Superposition also raises open questions like operationalizing computation in superposition \citep{vaintrob_mathematical_2024}, attention head superposition \citep{elhage_toy_2022,jermyn_circuits_2023,lieberum_does_2023,gould_successor_2023}, representing feature clusters \citep{elhage_toy_2022}, connections to adversarial robustness \citep{elhage_toy_2022}, anti-correlated feature organization \citep{elhage_toy_2022}, and architectural effects \citep{nanda_200superposition_2023}.
— Mechanistic Interpretability for AI Safety -- A Review
(2404.14082 - Bereska et al., 22 Apr 2024) in Future Directions, Subsection 8.1 Clarifying Concepts — Corroborate or Refute Core Assumptions