Organization of Anti-Correlated Features Under Superposition
Characterize how anti-correlated features are organized and represented under superposition in neural networks, including how such anti-correlation affects interference, sparsity, and the interpretability of feature dictionaries.
References
Superposition also raises open questions like operationalizing computation in superposition \citep{vaintrob_mathematical_2024}, attention head superposition \citep{elhage_toy_2022,jermyn_circuits_2023,lieberum_does_2023,gould_successor_2023}, representing feature clusters \citep{elhage_toy_2022}, connections to adversarial robustness \citep{elhage_toy_2022}, anti-correlated feature organization \citep{elhage_toy_2022}, and architectural effects \citep{nanda_200superposition_2023}.
— Mechanistic Interpretability for AI Safety -- A Review
(2404.14082 - Bereska et al., 22 Apr 2024) in Future Directions, Subsection 8.1 Clarifying Concepts — Corroborate or Refute Core Assumptions