Attention Entropy Collapse in Transformers

Updated 16 March 2026

Attention Entropy Collapse is a phenomenon where the self-attention entropy drops near zero, causing token focus to narrow and destabilizing transformer learning.
It manifests through nearly one-hot, rank-one attention matrices, vanishing gradients, and degraded performance in tasks like reinforcement learning, 3D vision, and language modeling.
Mitigation strategies such as entropy regularization, spectral normalization, and adaptive token merging are used to preserve model expressivity and training stability.

Attention entropy collapse is a critical phenomenon in transformer-based architectures whereby the entropy of the self-attention distributions rapidly declines, resulting in overly concentrated or “peaked” attention on a small subset of tokens. This degeneracy fundamentally impairs the model’s capacity to capture rich dependencies, destabilizes learning dynamics, and causes sharp performance degradation—especially when scaling model capacity or when removing key architectural nonlinearities. Attention entropy collapse has been analyzed and mitigated across a range of machine learning domains, including reinforcement learning, 3D vision, sequence modeling, private inference for LLMs, and generic self-attention networks.

1. Mathematical Definition and Manifestations

Let $A\in\mathbb{R}^{n\times n}$ denote the attention probability matrix after softmax normalization; its entries $A_{ij}$ satisfy $A_{ij}\ge0$ and $\sum_j A_{ij}=1$ for each row $i$ . The rowwise Shannon entropy is defined as

$H(A_{i\cdot}) = -\sum_{j=1}^n A_{ij}\log A_{ij},$

and the mean entropy over all queries is

$H(A) = \frac{1}{n} \sum_{i=1}^n H(A_{i\cdot}).$

Attention entropy collapse refers to the regime where $H(A)\to 0$ , i.e., each row of $A$ becomes nearly one-hot. In empirical studies, this is characterized by highly localized attention, with the model effectively ignoring all but a few tokens per query. This phenomenon has been observed:

In deep or wide transformer Q-functions for reinforcement learning (Dong et al., 1 Feb 2026)
In the global-attention blocks of large-scale vision transformers (e.g., VGGT) as input size grows (Li et al., 25 Dec 2025)
Across domains as a root cause of instability or divergence during training (Zhai et al., 2023)
In decoder-only transformers designed for private inference, when nonlinearities are removed (Jha et al., 7 Jan 2025)

2. Theoretical Origins: Spectral Norms, Eigenspectrum, and Mean-Field Analysis

A unifying theoretical perspective links attention entropy collapse to properties of the attention-logit generator matrices. In single-head attention, the unnormalized logits are $L = X W_Q (X W_K)^\top / \sqrt{d_k}$ , where $X$ is the token matrix and $W_Q, W_K$ are the query and key weights.

Key theoretical results include:

The minimum possible attention entropy for each row is exponentially decreasing in the spectral norm of the logits,

$H(A_i) \ge \log\left(1 + (n-1)\exp(-\|L\|_2)\right) + \mathcal{O}(\|L\|_2 e^{-\|L\|_2}),$

implying that unconstrained weight growth drives entropy to zero (Zhai et al., 2023).

In multi-layer settings, mean-field and diffusion analyses show that the entropy decay rate is $O(1/L)$ in layer depth, and the attention matrix becomes near rank-one (Li et al., 25 Dec 2025).
The eigenspectrum variance of the QK-parameter matrix (e.g., $\mathrm{Var}[w]$ for eigenvalues $w$ ) controls the lower bound on entropy; small variance prevents both rank and entropy collapse (Bao et al., 2024).

3. Empirical Consequences and Failure Modes

Attention entropy collapse has several empirically verified failure modes:

Learning Instability: In RL, training of large transformer critics is destabilized, with average success rates collapsing as model size increases, tightly coupled to vanishing attention entropy (Dong et al., 1 Feb 2026).
Expressivity Loss: Near rank-one attention matrices cause token representations to collapse onto an almost one-dimensional submanifold, decimating the model’s ability to distinguish input states or views (Li et al., 25 Dec 2025).
Brittle Q-Surfaces: In RL, over-peaked attention yields non-smooth Q-surfaces with high-frequency oscillations (Dong et al., 1 Feb 2026).
Trainability Plateaux: In sequence models, low entropy is associated with vanishing gradients for most tokens, and loss surfaces develop plateaux that slow or halt improvement (Bao et al., 2024).
Training Divergence: Empirical results across modalities—vision, language modeling, machine translation, speech—show that models suffer divergence or oscillating loss when any layer/head enters the low entropy regime (e.g., $H\lesssim 0.5$ ) (Zhai et al., 2023).
Failure in Private Inference: Removal of nonlinearities for private inference induces dual failure modes: entropy collapse in deep layers and “entropic overload” (excessively diffuse attention, high entropy) in shallow layers (Jha et al., 7 Jan 2025).

4. Mechanisms and Conditions Favoring Entropy Collapse

Several architectural and optimization factors precipitate attention entropy collapse:

Scaling Model Capacity: Larger hidden dimensions and greater depth in transformers reliably reduce attention entropy, unless explicitly regularized (Dong et al., 1 Feb 2026).
Lack of Spectral/Norm Constraints: Unconstrained growth of attention-logit weight norms ( $\|W_Q W_K^\top\|_2$ ) pushes entropy toward zero (Zhai et al., 2023).
Absent Nonlinearities or Normalization: Removing LayerNorm, GELU/ReLU, or similar nonlinear operations destabilizes attention-head diversity and can precipitate both collapse and overload, especially in resource-constrained or privacy-sensitive settings (Jha et al., 7 Jan 2025).
Global Self-Attention over Long Sequences: In tasks like long-sequence 3D vision, large-scale global attention without periodic down-sampling leads to collapse at a rate precisely predicted by a mean-field PDE (Li et al., 25 Dec 2025).
Non-concentrated QK Eigenspectrum: High variance in the QK-parameter eigenspectrum increases the spectral norm and drives entropy lower, fostering collapse; small variance stabilizes both entropy and effective rank (Bao et al., 2024).

5. Mitigation Strategies and Regularization Approaches

Multiple mitigation mechanisms have been proposed and empirically validated:

Regularization Strategy	Target of Control	Core Mechanism
Layer-wise entropy regularization (Dong et al., 1 Feb 2026)	Entropy per attention layer	Adds loss terms to drive entropy toward a task-specific target
Spectral normalization / σReparam (Zhai et al., 2023, Jha et al., 7 Jan 2025)	Spectral norm of weights	Reparametrize linear layers to cap spectral norm
Learnable attention temperatures (Jha et al., 7 Jan 2025)	Head-level softmax sharpness	Adjust temperature per head/query to keep entropy near threshold
Static weight normalization (Jha et al., 7 Jan 2025)	FFN weight scale	Replaces LayerNorm for private inference
QK-eigenspectrum variance minimization (“LocAteR”) (Bao et al., 2024)	Variance of QK eigenvalues	Penalizes trace of $W^2$ while enforcing trace $W\approx 1$
Token merging/downsampling (Li et al., 25 Dec 2025)	Effective layer depth	Reduces token count to slow down diffusion and delay collapse

For each strategy, empirical studies report that stabilizing attention entropy (i) prevents catastrophic performance drops, (ii) enables efficient hyperparameter tuning, and (iii) allows for scaling transformers to larger model sizes without the previously observed degeneracies.

6. Domain-Specific Examples and Outcomes

Transformer Q-Learning (TQL): Entropy-regularized critic loss stabilizes large RL value networks, yielding up to a 43% relative performance gain as model capacity increases (Dong et al., 1 Feb 2026).
VGGT and Token Merging: In large-vision transformers, periodic token merging stretches the entropy collapse timescale, improving reconstruction accuracy and geometric discrimination (Li et al., 25 Dec 2025).
Language Modeling and Translation: σReparam enables deep transformers to train stably (up to 100 layers) without LayerNorm or adaptive optimizers, eliminating entropy collapse and oscillating losses (Zhai et al., 2023).
Private LLMs: Careful per-head entropy regularization and LayerNorm alternatives avoid both collapse and overload in architectures where standard nonlinearities are omitted to reduce inference cost (Jha et al., 7 Jan 2025).
Eigenspectrum Regularization: Direct control of the QK-parameter spectrum simultaneously prevents entropy and rank collapse, and empirically lowers perplexity and improves signal propagation in sequence modeling (Bao et al., 2024).

7. Limitations and Open Research Questions

Although entropy regularization and spectral constraints have empirically stabilized training and improved robustness in multiple domains, several open questions remain:

Optimal selection and automation of entropy targets ( $\bar{H}$ ) for regularization is still an open problem (Dong et al., 1 Feb 2026).
Joint optimization of architectural norms and entropy objectives can introduce implementation complexity and may interact nontrivially with other stabilization techniques (Jha et al., 7 Jan 2025).
No general theoretical convergence guarantees exist for joint parameter and entropy coefficient updates (Dong et al., 1 Feb 2026).
In some settings, aggressive regularization or extreme token merging risks over-regularizing and reducing expressiveness or geometric resolution (Li et al., 25 Dec 2025).
The interplay between entropy collapse and other failure modes, such as rank collapse, sharpness boundaries, and underfitting due to entropic overload, continues to be an active area of theoretical and empirical inquiry (Bao et al., 2024, Jha et al., 7 Jan 2025).

Future research is directed toward adaptive online monitoring and control of attention entropy, generalizing entropy-based stabilization methods across modalities (e.g., world models, trajectory transformers), and exploring multi-modal and anisotropic attention dynamics. Attention entropy collapse provides a principled framework for diagnosing and rectifying scaling pathologies in transformer architectures and remains a foundational concept for scalable, stable deep learning (Dong et al., 1 Feb 2026, Li et al., 25 Dec 2025, Zhai et al., 2023, Jha et al., 7 Jan 2025, Bao et al., 2024).