Organization of Anti-Correlated Features Under Superposition

Characterize how anti-correlated features are organized and represented under superposition in neural networks, including how such anti-correlation affects interference, sparsity, and the interpretability of feature dictionaries.

Background

Superposition requires managing interference among features; anti-correlated features present a special case that may produce systematic structure in activation space impacting disentanglement and manipulation.

Understanding anti-correlated organization could inform regularization or architectural strategies to reduce harmful interference and improve monosemanticity.

References

Superposition also raises open questions like operationalizing computation in superposition \citep{vaintrob_mathematical_2024}, attention head superposition \citep{elhage_toy_2022,jermyn_circuits_2023,lieberum_does_2023,gould_successor_2023}, representing feature clusters \citep{elhage_toy_2022}, connections to adversarial robustness \citep{elhage_toy_2022}, anti-correlated feature organization \citep{elhage_toy_2022}, and architectural effects \citep{nanda_200superposition_2023}.

— Mechanistic Interpretability for AI Safety -- A Review (2404.14082 - Bereska et al., 22 Apr 2024) in Future Directions, Subsection 8.1 Clarifying Concepts — Corroborate or Refute Core Assumptions

Organization of Anti-Correlated Features Under Superposition

Sponsor

Background

References

Related Problems