Representation of Feature Clusters Under Superposition
Characterize how neural networks represent clusters of related features when features are encoded in superposition, including how clustering structure is reflected in activation space and how such structure influences computation and interference.
References
Superposition also raises open questions like operationalizing computation in superposition \citep{vaintrob_mathematical_2024}, attention head superposition \citep{elhage_toy_2022,jermyn_circuits_2023,lieberum_does_2023,gould_successor_2023}, representing feature clusters \citep{elhage_toy_2022}, connections to adversarial robustness \citep{elhage_toy_2022}, anti-correlated feature organization \citep{elhage_toy_2022}, and architectural effects \citep{nanda_200superposition_2023}.
— Mechanistic Interpretability for AI Safety -- A Review
(2404.14082 - Bereska et al., 22 Apr 2024) in Future Directions, Subsection 8.1 Clarifying Concepts — Corroborate or Refute Core Assumptions