Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification (2505.06032v1)

Published 9 May 2025 in cs.LG and cs.CL

Abstract: Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of LLMs. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.

Authors (3)

Leon Eshuijs (4 papers)
Shihan Wang (15 papers)
Antske Fokkens (19 papers)

Summary

Mechanistic Investigation of Shortcuts in Text Classification

In their paper, Eshuijs, Wang, and Fokkens delve into the mechanistic understanding of how shortcuts—spurious correlations exploited by LLMs—are processed during text classification tasks. The research dissects these shortcuts by leveraging specific interpretability techniques, enhancing our understanding of model decision-making mechanisms.

Methodological Innovations

The research employs actor names in movie reviews as a controlled case paper, using these names as shortcuts that can predict sentiment, regardless of the full review context. This manipulation forms the basis for the ActorCorr dataset—a novel collection designed to test model susceptibility to such shortcuts.

The authors apply mechanistic interpretability methods to identify attention heads within the LLM that focus on these shortcuts and steer prediction results prematurely. This approach is crucial for mapping the trajectory of specific token activations within the model and how they influence outcomes. The paper proposes a novel method termed Head-based Token Attribution (HTA), which pinpoints decisions back to specific input tokens and reveals which attention heads are involved in shortcut-related predictions.

Key Findings

The investigation demonstrates that specific attention heads—termed "Label Heads"—play a pivotal role by anticipating the sentiment label before a comprehensive input review process. Notably, these heads leverage early-layer MLP layer activations to retrieve shortcut-related semantic features. This functionality results in an intermediate prediction, contributing significant label-specific information that affects the output decision.

Quantitative evaluations underscore HTA's efficacy in detecting these shortcuts, outperforming traditional interpretability methods such as LIME and Integrated Gradients. The HTA method provides better separability between shortcut and non-shortcut tokens without necessitating a fixed threshold, which is a notable advantage over existing approaches.

Practical Implications

This research holds substantial promise for enhancing LLM reliability, particularly in mitigating shortcut effects that undermine performance on out-of-distribution data. HTA emerges not only as a potent detection tool but also as a means to address shortcuts through targeted interventions, such as deactivating specific attention heads. This capability allows for precision in refining model predictions while minimally impacting non-shortcut-related classification tasks.

Future Directions

The paper opens several avenues for future research. Expanding the scope beyond movie reviews to other domains could validate whether similar attention mechanisms govern shortcut processing across different data types. Moreover, extending this mechanistic approach to transformer architectures other than the decoder-only models used here could generalize the findings, providing broader insights applicable across various LLM configurations. Investigating whether analogous mechanisms apply to other model biases—such as gender or racial bias—could be pivotal for ethical AI deployment.

In conclusion, the paper by Eshuijs, Wang, and Fokkens advances both theoretical and practical understanding of shortcut exploitation in LLMs, contributing valuable tools and insights for developing more robust LLMs. Their work underscores the importance of dissecting model internals for transparent AI systems, ensuring predictive reliability and fairness in practical applications.

Related Papers

Find Related Papers

YouTube

Show All Videos