Papers
Topics
Authors
Recent
2000 character limit reached

Dormant Head Pruning in Transformers

Updated 24 December 2025
  • Dormant head pruning is a technique that identifies and removes attention heads whose contributions to model outputs are negligible.
  • It employs criteria like activation norms, gradient-based importance, and attention entropy to detect low-impact heads.
  • Pruning these dormant heads enhances inference speed, computational efficiency, and parameter reduction while preserving model accuracy.

Dormant head pruning refers to the identification and removal of attention heads in Transformer-based architectures that fail to make significant contributions to the model’s outputs or loss, a phenomenon empirically observed in both language and vision domains. The resulting pruned models retain the most salient or specialized heads, yielding improvements in inference speed, computational efficiency, and, in some setups, parameter efficiency while incurring negligible or rigorously bounded accuracy degradation. Dormant heads are generally defined through empirical criteria such as low activation norms, minimal gradient-based importance, or high attention entropy, and can be discovered and removed using a spectrum of structured, hardware-aware, or end-to-end differentiable approaches. The following sections survey the principal methodologies, definitions, empirical evidence, algorithmic strategies, and context of dormant head pruning across recent literature.

1. Formal Definitions and Identification Criteria for Dormant Heads

Dormant heads are operationally defined as attention heads whose outputs or influence on the model’s predictions are statistically or functionally negligible. Multiple quantitative criteria have been established:

  • Activation-Norm-Based (HONOR): A head hh in layer \ell is deemed dormant if its average output norm is significantly less than the per-layer mean, i.e.,

AvgNorm(Hh())μ()<τ,\frac{\mathrm{AvgNorm}(H_{h}^{(\ell)})}{\mu^{(\ell)}} < \tau,

where Hh()H_{h}^{(\ell)} is the head output and μ()\mu^{(\ell)} is the mean norm across all heads in the layer; τ0.478\tau\approx0.478 is a robust threshold across models for up to 25% pruning with \leq1% accuracy drop (Sandoval-Segura et al., 4 Apr 2025).

  • Gradient-Based Importance: The expected gradient magnitude with respect to a gating mask on the head, S(u,i)=ExDL/ξ(u,i)S^{(u,i)} = E_{x\sim\mathcal{D}} |\partial\mathcal{L}/\partial\xi(u,i)|, quantifies the sensitivity of loss to disabling the head. Low S(u,i)S^{(u,i)} identifies dormant heads (Liu et al., 2023, Choi et al., 10 Oct 2025).
  • Head Importance–Entropy Score (HIES): Combines gradient-based importance (HIS) and normalized attention entropy to capture both loss impact and diversity of attention patterns:

HIESh=αI~h+(1α)(1H~h),\mathrm{HIES}_h = \alpha\,\tilde{I}_h + (1-\alpha)(1-\tilde{H}_h),

with 0α<10 \leq \alpha < 1; enables more accurate and stable dormant head detection (Choi et al., 10 Oct 2025).

  • Max-Attention-Based "Confidence": In federated and distributed training, the average maximum attention weight per head, αhc\alpha_{h c}, is used as a confidence score for head importance at each client (Venkatesha et al., 31 May 2025).
  • Neuron-Level and SVD/Alignment Metrics: In vision transformers, informativeness is derived from SVD decompositions of per-head attention matrices, scoring the alignment of query–key outer products with principal singular components; redundant heads are identified via systematic zeroing of low-informative Q–K neuron pairs and V filters (Shim et al., 2024).

2. Algorithmic Pruning Strategies

A variety of algorithmic frameworks for dormant head pruning have been established:

  • Thresholding and Masking: Compute per-head statistics (activation norm, importance score, entropy, or confidence), then zero or remove all heads below a predefined threshold or percentile, sometimes applied dynamically per input (Sandoval-Segura et al., 4 Apr 2025, Liu et al., 2023, Choi et al., 10 Oct 2025, Venkatesha et al., 31 May 2025).
  • Cascade Pruning: SpAtten propagates pruning decisions across attention layers, using cumulative head importance (ℓ₁-norm sum over outputs) and a top-kk algorithm in hardware to ensure efficient execution. Heads dropped in a layer remain pruned in all subsequent layers (Wang et al., 2020).
  • Differentiable Subset/Structured Pruning: DSP and other methods introduce continuous gates ghg_h for each head (constrained via Gumbel-Softmax/Concrete relaxations), learned end-to-end together with model weights. Upon convergence, hard thresholding produces a sparse head configuration (Li et al., 2021, Shim et al., 2021).
  • Graph-Structured Neuron Pruning: The SNP framework constructs bipartite graphs of query–key filter pairs, pruning in a way that respects structural equality constraints and structured residuals, thereby ensuring the complete deactivation and removable of dormant heads in the vision domain (Shim et al., 2024).
  • Federated/Client-Weighted Pruning: In distributed PEFT, dormant head pruning is conducted per-client and communicated heads are down-selected to those exceeding local confidence thresholds, followed by importance-weighted aggregation of client-specific updates (Venkatesha et al., 31 May 2025).
  • Attention Sink Detection: Heads consistently attending to uninformative “sink tokens” are identified statistically; if a head’s attention is drawn excessively to such tokens, it is a candidate for pruning (Sandoval-Segura et al., 4 Apr 2025).

3. Quantitative Empirical Evidence and Benchmarks

Extensive empirical studies confirm the efficacy of dormant head pruning:

Criterion Pruning Fraction Accuracy Drop (max) Task/Model Reference
HONOR 17–26% ≤1% Llama-2/3, OLMo-2, Qwen2.5 (ARC, MMLU, GSM8K) (Sandoval-Segura et al., 4 Apr 2025)
SpAtten Cascade 10% None measurable BERT, GPT-2 (30 NLP benchmarks) (Wang et al., 2020)
Gradient/HIS ~50% 7–15% improvement† BERT, ViT, LLaMA-2, LLaVA (Choi et al., 10 Oct 2025)
SNP (ViT) 56% QK/55% V −1.3% DeiT-Small (ImageNet top-1) (Shim et al., 2024)
Confidence 90% <2% T5-small/LoRA (federated MultiNLI, XL-Sum, E2E) (Venkatesha et al., 31 May 2025)

† Relative average downstream task accuracy compared to HIS-only baselines.

Substantial speedup and parameter reduction are universally observed:

4. Practical Implementation Considerations

Effective dormant head pruning requires careful algorithmic and architectural design:

  • Stability: Techniques such as gate initialization bias, sparsity warm-up, and activation scaling prevent collapse or excessive pruning during training (Shim et al., 2021).
  • Hardware Support: Fast top-kk selection engines and unified masking strategy (zeroing outputs) enable real-time, per-input pruning and immediate hardware speedup—FPGA/ASIC implementations show negligible area/power overhead for the selection logic (Wang et al., 2020).
  • Retraining/Calibration: Fine-tuning of pruned networks recovers most or all performance lost to aggressive pruning, and iterative calibration is used in SNP and other frameworks (Shim et al., 2024).
  • Hyperparameters: Pruning thresholds (either direct or proportion-based) are set via empirical calibration; typical safe values are τ0.48\tau\approx0.48 for activation-norm, 10–15% head pruning for no accuracy drop, 20–25% for ≤1% degradation (Sandoval-Segura et al., 4 Apr 2025, Choi et al., 10 Oct 2025).
  • Layerwise/Global Normalization: Importance and entropy metrics can be normalized either per-layer or globally; layerwise normalization enhances adaptivity in deeply stacked architectures (Choi et al., 10 Oct 2025).
  • Compatibility: Structured neuron-level methods comfortably integrate with existing head/block pruning, adapters, LoRA, and other PEFT schemes (Shim et al., 2024, Venkatesha et al., 31 May 2025).

5. Applications and Model Families

Dormant head pruning is broadly applicable:

6. Analysis: Dynamics, Emergence, and Limitations

The emergence and behavior of dormant heads is non-trivial:

  • Pretraining Dynamics: Dormant head percentage is near zero in early pretraining, spikes sharply during specific training phases, and subsequently stabilizes; individual heads can transition between active and dormant states (Sandoval-Segura et al., 4 Apr 2025).
  • Dependency on Data: Dormancy rates vary with input characteristics—fluent prose correlates with higher dormancy, while structured/math/code-like inputs exhibit less dormancy (Sandoval-Segura et al., 4 Apr 2025).
  • Robustness Across Models: HONOR and max-norm–based definitions generalize well across architectures, tasks, and data regimes with minimal hyperparameter adjustments (Sandoval-Segura et al., 4 Apr 2025, Choi et al., 10 Oct 2025).
  • Architectural Constraints: Full proportional computational reduction requires architectures (e.g., All-attention without distinct feed-forward modules) where head pruning removes entire units of compute and parameters (Shim et al., 2021).
  • Ablation Safety: Noise replacement in dormant heads (instead of zeroing) shows similar invariance in model accuracy, supporting the irrelevance hypothesis for such heads (Sandoval-Segura et al., 4 Apr 2025).

7. Future Perspectives and Open Directions

Current evidence demonstrates that dormant head pruning is a robust, principled, and efficient means of structured model sparsification with wide applicability. Open research areas include:

  • Advanced importance criteria leveraging richer statistical or information-theoretic head behaviors.
  • Dynamic or sample-specific dormant head reconfiguration for adaptive inference.
  • Integration with other sparsity paradigms (e.g., neuron/block pruning, quantization) in unified frameworks.
  • Further empirical analysis of dormant head specialization, redundancy, and potential functional repurposing (e.g., for explicit structure encoding or auxiliary reasoning tasks).
  • Hardware-software co-designs extending beyond top-kk routing to support online per-head computation gating at scale, especially in resource-constrained or privacy-preserving applications.

The state-of-the-art positions dormant head pruning as both a tool for efficient Transformer deployment and a window into the interpretability and specialization of self-attention architectures (Wang et al., 2020, Shim et al., 2024, Shim et al., 2021, Li et al., 2021, Liu et al., 2023, Choi et al., 10 Oct 2025, Sandoval-Segura et al., 4 Apr 2025, Venkatesha et al., 31 May 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Dormant Head Pruning.