Papers
Topics
Authors
Recent
Search
2000 character limit reached

Attention Heads in LLMs

Updated 13 March 2026
  • Attention heads are modular units in LLMs that project queries, keys, and values into lower-dimensional subspaces and compute scaled dot-products to extract relevant signals.
  • They underpin diverse capabilities—including syntactic parsing, arithmetic reasoning, semantic recall, and safety behaviors—by localizing functions to specific heads.
  • Recent ablation studies reveal that only a sparse set of heads is causally necessary, enabling targeted interventions for improved model interpretability and performance.

An attention head in a LLM is a specialized computational unit within a multi-head attention module that projects the input token representations (queries, keys, and values) into lower-dimensional subspaces, computes scaled dot-product affinities, and extracts context-specific signals to propagate through the Transformer architecture. Attention heads collectively underpin a wide range of LLM capabilities, from syntactic parsing and arithmetic to semantic recall and safety behavior. Recent mechanistic analysis has revealed that head-level computations are highly modular, with only sparse, distinct sets of heads being causally necessary for specific abilities.

1. Structural and Mathematical Underpinnings

A Transformer layer contains HH parallel attention heads. For X∈RT×dX \in \mathbb{R}^{T \times d} (input token representations), each head hh computes

Qh=XWhQ,Kh=XWhK,Vh=XWhV,Q_h = X W^Q_h,\quad K_h = X W^K_h,\quad V_h = X W^V_h,

Ah=softmax(QhKh⊤dk)∈[0,1]T×T,A_h = \mathrm{softmax}\left(\frac{Q_h K_h^\top}{\sqrt{d_k}}\right)\in [0,1]^{T\times T},

Oh=AhVh.O_h = A_h V_h.

Heads operate on dk=d/Hd_k = d / H-dimensional projections. The multi-head outputs {O1,…,OH}\{O_1, \ldots, O_H\} are concatenated and post-processed via WOW^O.

Extensions such as Grouped Query Attention (GQA) and Knocking-Heads Attention (KHA) generalize this structure by introducing cross-head parameter sharing and feature-wise interactions (e.g., TQ,TK,TVT^Q, T^K, T^V as shared transforms), modestly increasing parameter efficiency and stability (Zhou et al., 27 Oct 2025).

2. Functional Modularity, Capability Localization, and Sparsity

Empirical studies show that most complex capabilities are implemented by highly localized—often extremely sparse—subsets of heads:

  • Capability localization: Compressed Sensing (CS) knockouts reveal that ablating as few as k≤5k \leq 5 of N=512N=512–$1024$ heads in Llama-/Qwen-type LLMs can cause a $50$–65%65\% accuracy drop on tasks (e.g., GSM8K, code generation), with little effect on unrelated benchmarks ($\DeltaGen < 3\%$) (Bair et al., 11 Feb 2026).
  • Compressed Sensing algorithm: For MM random ablation masks Φ∈{0,1}M×N\Phi \in \{0,1\}^{M \times N} (each row zeros a subset of heads), output degradation follows

y−β0⋅1=Φx+ϵy - \beta_0 \cdot 1 = \Phi x + \epsilon

Solving the â„“1\ell_1-minimization recovers the kk-sparse task-critical heads.

Task Model ΔTask (%) ΔGen (%)
GSM8K Llama-8B –48.4 –1.1
GSM8K Qwen-7B –65.4 –1.8
MBPP Llama-8B –16.0 –2.0
Swear Llama-8B –85.4 –0.4

3. Specialization, Redundancy, Dormancy, and Pruning

Attention heads exhibit notable specialization, but redundancy remains widespread:

  • Dormant/sink heads: Definitions such as HONOR (Sandoval-Segura et al., 4 Apr 2025) and BOS sink score (Sok et al., 11 Jan 2026) reveal that a significant fraction of heads can be classified as "dormant" (very low average output norm) or "sink" heads (overwhelming attention on BOS or first token). Zeroing up to 14%14\% of dormant heads or aggressively pruning high-BOS sink heads yields negligible accuracy loss on general benchmarks, with redundancy most pronounced in deep layers.
  • Dynamic transitions: Dormant heads can become active and vice versa over pretraining (Sandoval-Segura et al., 4 Apr 2025).
  • Block-level compression: Layer-wise BOS sink score identifies deep layers amenable to entire block pruning (Sok et al., 11 Jan 2026). Sink-aware pruning strategies preserve accuracy far more effectively than magnitude- or activation-based heuristics.

4. Mechanistic Roles: Knowledge, Induction, Semantic Reasoning, and In-Context Learning

Systematic surveys and causal interventions have mapped attention heads to fine-grained computational motifs:

  1. Four-stage functional framework (Zheng et al., 2024):
    • Knowledge Recalling (parametric retrieval, e.g. memory heads)
    • In-Context Identification (context, syntax, matcher heads)
    • Latent Reasoning (induction, function-vector, semantic induction heads)
    • Expression Preparation (signal amplification, mixed heads)
  2. In-context learning (ICL):
    • Function-vector (FV) heads: Crucial for few-shot ICL; FV heads maintain latent summaries of prompt-determined functions, and ablating them causes sharp drops in ICL accuracy—even in large models (Yin et al., 19 Feb 2025).
    • Induction heads: Mechanisms for repetition-based pattern matching (e.g., "...A B ... A → B") that scaffold FV emergence in smaller models and early training (Yin et al., 19 Feb 2025, Ren et al., 2024).
    • Semantic induction heads: Specifically raise output logits for tail tokens of relational triplets, essential for pattern discovery and semantic ICL (Ren et al., 2024).
  3. Specialization for Multimodal or Task-Specific Roles:
    • Visual heads: Attention heads focusing on image tokens in multimodal LLMs are prominent in early/mid layers, with ablation sharply degrading visual understanding (Bi et al., 2024).
    • Retrieval heads: Highly discriminative attention-based retrieval heads (CoRe heads) can be contrastively isolated for re-ranking, providing state-of-the-art listwise retrieval from <1% of heads (Tran et al., 2 Oct 2025).

5. Methodologies for Head Analysis and Attribution

Several experimental paradigms have become standard for analyzing attention heads:

  • Ablation and path patching: Causal attribution of head-level importance by replacing or zeroing activations for measurement of output impact (Bair et al., 11 Feb 2026, Zhang et al., 2024, Zhang et al., 17 Feb 2025).
  • Norm/voting statistics: Head norm voting (NoVo) aggregates L2-norms of head outputs ("truth scores") with simple inference-only selection, enabling robust truthfulness and adversarial defense via ensemble prediction (Ho et al., 2024).
  • Capability and bias attribution: Safety Head ImPortant Score (Ships) measures the difference in model harmfulness with and without a head, and Sahara ranks heads by their causal contribution to safety (Zhou et al., 2024). For bias, a gradient-based attribution over the SEAT score ranks heads by their effect on stereotypical associations (Yang et al., 2023).
  • Parameter sharing and efficiency: Head-wise shareable attention exploits similarity between (Wq, Wk) pairs, supporting aggressive compression by parameter sharing with minimal loss (Cao et al., 2024). KHA generalizes to cross-head feature sharing with minimal overhead (Zhou et al., 27 Oct 2025).
  • Redundancy quantification: Bidirectional head ablation measures model resilience to systematic masking as a function of model scale, revealing that larger models maintain lower perplexity until a significantly higher fraction of heads are ablated—an analogue to "cognitive reserve" (Li et al., 2024).

6. Practical Implications: Interpretability, Model Editing, and Future Directions

The modularity and localization of attention-head functions in LLMs has several direct consequences:

  • Interpretability: Sparse causal circuits and direct mapping from head activations to model behaviors enable fine-grained interpretability via post-hoc intervention, norm voting, and focus direction methods (Zhu et al., 30 Mar 2025, Ho et al., 2024).
  • Model editing: Targeted ablation or fine-tuning of identified critical heads can add, subtract, or steer specific behaviors (e.g., swearing, factual recall, translation) with minimal performance loss on other tasks (Bair et al., 11 Feb 2026, Zhang et al., 17 Feb 2025, Zhang et al., 2024).
  • Safety, fairness, and alignment: Single safety heads can act as gatekeepers, with removal leading to catastrophic rises in the rate of harmful outputs; similar principles apply to bias, where masking a handful of heads suffices to mitigate measured stereotype effects (Zhou et al., 2024, Yang et al., 2023).
  • Compression and deployment: Dynamic pruning of dormant/sink heads and head-wise sharing supports low-memory and fast-inference LLM deployment, especially on edge devices (Sandoval-Segura et al., 4 Apr 2025, Cao et al., 2024).
  • Long-context, multimodal, and modular architectures: Leveraging head-level specialization, frameworks such as LongHeads enable linear-time, training-free adaptation of any LLM to long-context processing by breaking up context across heads (Lu et al., 2024).
  • Generalization and future work: Consistent findings across models (1B–13B), tasks, and training regimes motivate the further development of head-level analysis for new abilities (reasoning, code, multimodal integration) and the design of architectures with explicit support for modular, semantically-partitioned attention (Bair et al., 11 Feb 2026).

7. Limitations, Open Problems, and Benchmarking Approaches

  • Circuit-level integration: Most studies focus on head isolation; the interplay between heads and feed-forward blocks in larger, real-world reasoning remains underexplored (Zheng et al., 2024).
  • Generalization across domains/tasks: While transferability of localized heads has been observed in arithmetic and translation, generality to complex open-ended reasoning or multimodal settings requires further study (Zhang et al., 2024, Zhang et al., 17 Feb 2025, Bi et al., 2024).
  • Evaluation protocols: Reconciling head-level mechanisms across toy, synthetic, and real benchmarks (e.g., MMLU, TruthfulQA, GSM8K) is ongoing. Metrics range from per-head ablation accuracy drops to retrieval/contrastive scores, logit impact, norm voting accuracy, and bias/safety causality (Ho et al., 2024, Sandoval-Segura et al., 4 Apr 2025).
  • Interpretation vs. intervention: Direct manipulation of heads for model editing shows promise, but concerns persist about cross-task interference and the stability of interventions over longer sequences or adversarial contexts.

Overall, the attention heads of LLMs are not homogeneous feature extractors, but a modular, distributed set of highly specialized computational subroutines whose mechanisms are increasingly accessible via contemporary interpretability methodologies. This enables targeted model steering, compression, and safety interventions with direct, causal guarantees grounded in measured head functions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Attention Heads of Large Language Models.