Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs (2410.13835v2)

Published 17 Oct 2024 in cs.LG

Abstract: Practitioners have consistently observed three puzzling phenomena in transformer-based LLMs: attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.

Citations (1)

Summary

  • The paper reveals that extreme-token phenomena arise from active-dormant dynamics in attention heads that vary by input domains.
  • It employs simplified transformer models and Bigram-Backcopy tasks to validate theoretical predictions on logarithmic attention growth and value-state shrinkage.
  • The findings offer actionable insights for improving LLM inference and quantization through targeted modifications in architecture and optimization.

Analysis of Extreme-Token Phenomena in Transformer-Based LLMs

This paper provides a detailed exploration of the extreme-token phenomena observed in transformer-based LLMs. The authors focus on three primary manifestations: attention sinks, value-state drains, and residual-state peaks, collectively termed as extreme-token phenomena. These phenomena are characterized by specific tokens, termed "sink tokens," which exhibit disproportionate attention weights, smaller value states, and larger residual norms compared to others.

Study and Mechanisms

The authors propose a mechanistic explanation for these phenomena, introducing the concepts of active-dormant mechanisms and mutual reinforcement mechanisms.

  1. Active-Dormant Mechanism:
    • The paper introduces the idea that attention heads can exhibit two distinct phases depending on the input domain. Some attention heads are active and contribute significantly to predictions in particular domains, while they remain dormant (i.e., act as sinks) in others. This behavior was observed both in simplified settings and in pre-trained LLMs like Llama and OLMo.
  2. Mutual Reinforcement Mechanism:
    • During pretraining, a mutual reinforcement mechanism is proposed where the presence of attention sinks and value-state drains reinforce each other. Specifically, attention heads dynamically shift towards these extreme tokens, driven by a process that stabilizes when attention weights become concentrated and value states shrink.

Methodology and Findings

  • Experimental Setup:

The authors employ simplified single-layer transformer architectures and the Bigram-Backcopy (BB) task to dissect the underlying mechanisms. This simplified setting replicates the phenomena observed in larger LLMs.

  • Theoretical Insights:

Using the BB model, the paper provides theoretical predictions about the behavior of extreme tokens, highlighting logarithmic growth of attention logits and shrinkage of value states.

  • Application to LLMs:

The authors extend their analysis to pretrained LLMs and confirm that many of their theoretical predictions hold, revealing consistent patterns in attention and value states akin to those demonstrated in the BB task. Through empirical studies, they also link the rise of extreme-token phenomena in LLMs to similar mutual reinforcement dynamics observed in simplified models.

Implications and Future Directions

The findings have several practical and theoretical implications:

  • Inference and Quantization:

Understanding and potentially mitigating extreme-token phenomena could result in improvements to LLM performance on long-context tasks and more efficient quantization strategies.

  • Model Interpretability:

The insights into active-dormant dynamics provide pathways to enhancing the interpretability of attention maps, particularly in vision transformers.

  • Future Research:

Suggestions for future work include exploring architectural and optimization modifications (e.g., replacing SoftMax with ReLU and Adam with SGD) to mitigate extreme-token phenomena and evaluating LLMs trained with these methods.

In summary, this paper analyzes and provides mechanistic insights into extreme-token phenomena in LLMs, delivering both theoretical and empirical evidence and proposing methods to alleviate related challenges in AI model training and application.