Representation of Feature Clusters Under Superposition

Characterize how neural networks represent clusters of related features when features are encoded in superposition, including how clustering structure is reflected in activation space and how such structure influences computation and interference.

Background

Superposition implies that multiple features can share overlapping neural resources; understanding how related features cluster within this regime is essential for disentangling representations and for building dictionaries (e.g., via sparse autoencoders) that reflect true underlying structure.

Clarifying cluster representation would aid in interpreting learned abstractions, mitigating interference, and improving methods for feature extraction and manipulation.

References

Superposition also raises open questions like operationalizing computation in superposition \citep{vaintrob_mathematical_2024}, attention head superposition \citep{elhage_toy_2022,jermyn_circuits_2023,lieberum_does_2023,gould_successor_2023}, representing feature clusters \citep{elhage_toy_2022}, connections to adversarial robustness \citep{elhage_toy_2022}, anti-correlated feature organization \citep{elhage_toy_2022}, and architectural effects \citep{nanda_200superposition_2023}.

— Mechanistic Interpretability for AI Safety -- A Review (2404.14082 - Bereska et al., 22 Apr 2024) in Future Directions, Subsection 8.1 Clarifying Concepts — Corroborate or Refute Core Assumptions

Representation of Feature Clusters Under Superposition

Sponsor

Background

References

Related Problems