Connections Between Superposition and Adversarial Robustness
Ascertain the causal relationships between superposition in neural representations and adversarial robustness, determining whether and how superposed feature encoding influences vulnerability or resilience to adversarial perturbations.
References
Superposition also raises open questions like operationalizing computation in superposition \citep{vaintrob_mathematical_2024}, attention head superposition \citep{elhage_toy_2022,jermyn_circuits_2023,lieberum_does_2023,gould_successor_2023}, representing feature clusters \citep{elhage_toy_2022}, connections to adversarial robustness \citep{elhage_toy_2022}, anti-correlated feature organization \citep{elhage_toy_2022}, and architectural effects \citep{nanda_200superposition_2023}.
— Mechanistic Interpretability for AI Safety -- A Review
(2404.14082 - Bereska et al., 22 Apr 2024) in Future Directions, Subsection 8.1 Clarifying Concepts — Corroborate or Refute Core Assumptions