Faithfulness and Scalability of Mechanistic Interpretability Techniques

Determine the faithfulness and scalability of current mechanistic interpretability techniques applied to transformer-based language models used in AI agents, establishing whether these methods produce reliable, causally accurate explanations and can be scaled to models of practical size for safety auditing.

Background

The paper situates mechanistic interpretability as a key component of baseline agent safety within distributed, multi-agent markets. It notes recent progress from identifying feature circuits and induction heads to methods addressing polysemanticity via sparse autoencoders, alongside complementary approaches such as causal scrubbing and automated circuit discovery.

Despite these advances, the authors emphasize that interpretability methods face limitations when used for agent auditing and safety assurance, particularly regarding whether explanations faithfully reflect internal causal mechanisms and whether such techniques can scale to large, modern models. This unresolved issue constrains reliance on interpretability alone and necessitates continued behavioral benchmarking and oversight.

References

However, despite these methodological advances, significant open problems remain regarding the faithfulness and scalability of current interpretability techniques \citep{Rai2024PracticalReviewMI, Sharkey2025OpenProblemsMI}.

Distributional AGI Safety (2512.16856 - TomaĊĦev et al., 18 Dec 2025) in Section 3.2.5 (Subsubsection "Mechanistic Interpretability")