Negative Attention Heads in Transformers

Updated 21 November 2025

Negative attention heads are defined as transformer components that consistently degrade performance by amplifying biases and confounding correlations.
Metrics like NAS, CoRe, and CHG quantitatively identify these heads across language, vision, and retrieval models via detailed ablation studies.
Targeted correction methods including NASA, LTC, and AAT effectively mitigate negative heads, enhancing accuracy, robustness, and fairness.

Negative attention heads are individual attention mechanisms within transformer-based models whose contributions systematically degrade model performance or amplify undesirable biases. These heads have been implicated in several phenomena, including negative output bias in LLMs, the amplification of spurious or confounding correlations in vision-LLMs, and the impairment of retrieval or ranking accuracy. Recent research has produced formal definitions, detection methodologies, and targeted correction techniques for negative attention heads across text, vision, and retrieval tasks.

1. Formal Definitions and Metrics

Negative attention heads are defined via their effect on task performance or bias metrics. In LLMs, a head is negative if it exhibits a persistent, instruction-agnostic preference for negative (e.g., "No") tokens in binary decision tasks. The Negative Attention Score (NAS) quantifies this:

$\mathrm{NAS}_x^{l,h} = \sum_{i = L_I+1}^{L_P} ( A^{l,h}_{i,t_+} + A^{l,h}_{i,t_-} ) \cdot \log \left( \frac{A^{l,h}_{i,t_-}}{A^{l,h}_{i,t_+}} \right)$

where $A^{l,h}$ is the head's attention matrix, $t_+$ / $t_-$ are token indices for positive/negative candidates, and $L_I$ bounds the instruction prefix. The per-head NAS is averaged over a dataset to rank heads by negative bias propensity (Yu et al., 2024).

In CLIP-style vision transformers, negative heads are those whose per-head representation reduces the logit score for the correct class in the presence of spurious confounders. The contribution is measured as:

$\mathbb{E}_{G_N}[V_S \mid \hat y = y^*] < 0$

where $V_S$ is the direct per-head effect on the correct logit averaged over negatively spurious examples (Yeo et al., 23 May 2025).

In retrieval settings, a negative head shows a low contrastive score, allocating more attention to irrelevant (hard-negative) documents than relevant ones. The CoRe score formalizes this:

$S_\mathrm{CoRe}(h) = \frac{\exp(s^h_\mathrm{pos}/t)}{\exp(s^h_\mathrm{pos}/t) + \sum_i \exp(s^h_{\mathrm{neg},i}/t)}$

A low $S_\mathrm{CoRe}(h)$ marks the head as detrimental to ranking (Tran et al., 2 Oct 2025).

In general transformer modeling, Causal Head Gating (CHG) identifies interfering (negative) heads as those whose ablation decreases model loss:

$\Delta \ell_i = \mathbb{E}_{(x, y)} \left[ \ell(x, y; G_i = 0) \right] - \mathbb{E}_{(x, y)} \left[ \ell(x, y; G_i = 1) \right]$

Heads with $\Delta \ell_i < 0$ are classified as negative or interfering (Nam et al., 19 May 2025).

2. Systematic Detection and Analysis

Identification of negative heads proceeds via quantitative metrics and ablation studies. The analysis workflow typically involves:

Dataset-level aggregation: Compute per-sample and per-dataset scores (NAS, CoRe, or logit-contribution) for each head.
Ranking and thresholding: Select heads with the highest negative bias or lowest task-contribution. For NAS, the top-200 heads with stable scores across >90% of samples are selected as negative (Yu et al., 2024).
Ablation impact validation: The effects of head ablation are empirically evaluated to confirm their negative contribution (e.g., via improvement in recall, mean-recall, F1, or worst-group accuracy).
Domain and instruction independence: Extensive cross-domain overlaps (74–80%) in negative heads are found, indicating query-agnosticity (Yu et al., 2024).

3. Model- and Modality-Specific Manifestations

Binary Decision Tasks in LLMs

Negative heads in LLMs are responsible for negative bias in binary decisions (e.g., "Yes/No," "True/False"). These heads—often stable across diverse question types—drive models to overpredict negative responses, leading to a large precision-recall gap (high precision, low recall for the positive label). Realigning negative heads via fine-grained tuning directly reduces negative bias and improves calibration (Yu et al., 2024).

Retrieval and Ranking

In LLM-powered attention-based rerankers, negative heads fail to discriminate relevant from irrelevant documents. Filtering or de-emphasizing these heads, as measured by the CoRe score, leads to marked gains in nDCG@10, with optimal performance achieved when only the top 1% of heads (the CoRe set) are used. These informative heads are concentrated in intermediate model layers (Tran et al., 2 Oct 2025).

Vision–Language and Multimodal Models

In CLIP’s Vision Transformer, negative heads either encode confounder-specific circuits (as in LTC) or degrade representation quality for retrieval/classification tasks (as in AAT). Mean ablation or contrastive manipulation of these heads, with targeted reinforcement of salient ones, eliminates spurious bias and improves worst-group accuracy by 23–50 percentage points, while enhancing generalization (Yeo et al., 23 May 2025, Lin et al., 1 Jul 2025).

4. Targeted Correction and Suppression Techniques

Multiple correction strategies for negative heads have been advanced:

Negative Attention Score Alignment (NASA) (Yu et al., 2024): A lightweight, sequential fine-tuning targeting only the Q/K matrices of negative heads, using contrastively recast training data. Early stopping and validation-based checks are used to prevent over-correction.
Locate-Then-Correct (LTC) (Yeo et al., 23 May 2025): Mechanistic identification of spurious heads (via logit-lens decomposition and contrastive group analysis), followed by mean ablation of spurious heads and orthogonal projection-based injection of class-discriminative signals into salient heads—all without finetuning.
Attention Ablation Technique (AAT) (Lin et al., 1 Jul 2025): Either genetic algorithm-based or backpropagation-based optimization is used to select negative heads; these are then masked during inference by suppressing their attention weights over non-CLS tokens. The approach is effective for edge, low-data, and large-scale scenarios.
Causal Head Gating (CHG) (Nam et al., 19 May 2025): Learnable gating for all heads (with differential L1 regularization), classifying heads into facilitating, interfering (negative), or irrelevant based on their isolated effect on model loss.
Contrastive CoRe selection/pruning (Tran et al., 2 Oct 2025): Aggregate attention only from high-contrastive-score heads, optionally pruning late layers for efficiency with minimal performance degradation.

Method	Target Context	Correction Mechanism
NASA	LLMs, binary tasks	Q/K fine-tune on negative heads
LTC	ViT-CLIP	Mean-ablation, orthogonal projection
AAT	ViT-CLIP	Attention suppression (masking)
CHG	LLMs, general	Soft gating + ablation
CoRe	LLM rerankers	Head selection + layer pruning

5. Empirical Impact and Observed Patterns

Eliminating negative attention heads consistently yields:

Reduced negative bias: Models become less over-cautious; recall and calibration improve, F1 scores rise (e.g., up to +0.15 in LLM binary tasks) (Yu et al., 2024).
Enhanced robustness and fairness: Post-correction, worst-group accuracy improves by >50% in bias-sensitive CLIP benchmarks (Yeo et al., 23 May 2025).
Improved retrieval accuracy: Restricting to high-contrastive-score (CoRe) heads boosts nDCG@10 by up to +4 points; best performance is attained with 1–2% of heads (Tran et al., 2 Oct 2025).
Efficiency and parsimony: Results show that relatively few negative heads exist consistently (e.g., 0.1–1% always interfering in Llama 3), and only hundreds of parameters need adjustment for substantial gains (Nam et al., 19 May 2025, Yu et al., 2024).
Layer dependence: Negative heads typically cluster in shallow and deep layers in vision models, and middle layers in retrieval models. Heads in the mid-layers of transformers are most informative for ranking; early/late layers often host redundant or detrimental heads (Lin et al., 1 Jul 2025, Tran et al., 2 Oct 2025).

6. Insights, Generalization, and Limitations

Query and data-agnosticity: Negative heads often manifest independently of the specific input, revealing global, model-internal failure modes or inductive biases (Yu et al., 2024).
Domain and data drift: Some negative heads encode spurious domain-specific features that degrade out-of-domain generalization (Lin et al., 1 Jul 2025).
Scalability and practical application: Correction approaches such as AAT-GA and LTC are training-free or require only minimal data/compute, making them suitable for large models and on-device deployment (Yeo et al., 23 May 2025, Lin et al., 1 Jul 2025).
Limitation on root cause understanding: The origins of persistent negative heads remain unclear, with hypotheses including capacity bottleneck, optimization artifacts, or inductive alignment failures. Current correction methods are effective but not explanatory (Yu et al., 2024).
Scope of applicability: Existing techniques predominantly address binary, classification, or retrieval tasks; extension to free-form, multi-span, or generative outputs is an open direction (Yu et al., 2024).

7. Relationship to Attention Diversity and Feature Collapse

Negative attention heads are conceptually distinct from generic redundancy or “attention collapse” but overlap in their detrimental effects. While classical diversity methods (e.g., regularization, repulsive attention) aim to spread features and prevent redundancy (An et al., 2020), negative head suppression is specifically concerned with heads that exert harmful, bias-amplifying, or confounder-propagating effects. Direct suppression or ablation (as in CHG or AAT) can therefore be viewed as a complementary procedure to diversity-promoting regularization, improving both representation quality and model trustworthiness. Empirical findings support that such targeted correction yields more pronounced and interpretable gains than global regularization alone (Yeo et al., 23 May 2025, Lin et al., 1 Jul 2025, Tran et al., 2 Oct 2025, Yu et al., 2024).