Interpreting Transformers Through Attention Head Intervention

Published 7 Jan 2026 in cs.CL | (2601.04398v1)

Abstract: Neural networks are growing more capable on their own, but we do not understand their neural mechanisms. Understanding these mechanisms' decision-making processes, or mechanistic interpretability, enables (1) accountability and control in high-stakes domains, (2) the study of digital brains and the emergence of cognition, and (3) discovery of new knowledge when AI systems outperform humans.