Negative Attention Heads in Transformers
- Negative attention heads are defined as transformer components that consistently degrade performance by amplifying biases and confounding correlations.
- Metrics like NAS, CoRe, and CHG quantitatively identify these heads across language, vision, and retrieval models via detailed ablation studies.
- Targeted correction methods including NASA, LTC, and AAT effectively mitigate negative heads, enhancing accuracy, robustness, and fairness.
Negative attention heads are individual attention mechanisms within transformer-based models whose contributions systematically degrade model performance or amplify undesirable biases. These heads have been implicated in several phenomena, including negative output bias in LLMs, the amplification of spurious or confounding correlations in vision-LLMs, and the impairment of retrieval or ranking accuracy. Recent research has produced formal definitions, detection methodologies, and targeted correction techniques for negative attention heads across text, vision, and retrieval tasks.
1. Formal Definitions and Metrics
Negative attention heads are defined via their effect on task performance or bias metrics. In LLMs, a head is negative if it exhibits a persistent, instruction-agnostic preference for negative (e.g., "No") tokens in binary decision tasks. The Negative Attention Score (NAS) quantifies this:
where is the head's attention matrix, / are token indices for positive/negative candidates, and bounds the instruction prefix. The per-head NAS is averaged over a dataset to rank heads by negative bias propensity (Yu et al., 31 Jul 2024).
In CLIP-style vision transformers, negative heads are those whose per-head representation reduces the logit score for the correct class in the presence of spurious confounders. The contribution is measured as:
where is the direct per-head effect on the correct logit averaged over negatively spurious examples (Yeo et al., 23 May 2025).
In retrieval settings, a negative head shows a low contrastive score, allocating more attention to irrelevant (hard-negative) documents than relevant ones. The CoRe score formalizes this:
A low marks the head as detrimental to ranking (Tran et al., 2 Oct 2025).
In general transformer modeling, Causal Head Gating (CHG) identifies interfering (negative) heads as those whose ablation decreases model loss:
Heads with are classified as negative or interfering (Nam et al., 19 May 2025).
2. Systematic Detection and Analysis
Identification of negative heads proceeds via quantitative metrics and ablation studies. The analysis workflow typically involves:
- Dataset-level aggregation: Compute per-sample and per-dataset scores (NAS, CoRe, or logit-contribution) for each head.
- Ranking and thresholding: Select heads with the highest negative bias or lowest task-contribution. For NAS, the top-200 heads with stable scores across >90% of samples are selected as negative (Yu et al., 31 Jul 2024).
- Ablation impact validation: The effects of head ablation are empirically evaluated to confirm their negative contribution (e.g., via improvement in recall, mean-recall, F1, or worst-group accuracy).
- Domain and instruction independence: Extensive cross-domain overlaps (74–80%) in negative heads are found, indicating query-agnosticity (Yu et al., 31 Jul 2024).
3. Model- and Modality-Specific Manifestations
Binary Decision Tasks in LLMs
Negative heads in LLMs are responsible for negative bias in binary decisions (e.g., "Yes/No," "True/False"). These heads—often stable across diverse question types—drive models to overpredict negative responses, leading to a large precision-recall gap (high precision, low recall for the positive label). Realigning negative heads via fine-grained tuning directly reduces negative bias and improves calibration (Yu et al., 31 Jul 2024).
Retrieval and Ranking
In LLM-powered attention-based rerankers, negative heads fail to discriminate relevant from irrelevant documents. Filtering or de-emphasizing these heads, as measured by the CoRe score, leads to marked gains in nDCG@10, with optimal performance achieved when only the top 1% of heads (the CoRe set) are used. These informative heads are concentrated in intermediate model layers (Tran et al., 2 Oct 2025).
Vision–Language and Multimodal Models
In CLIP’s Vision Transformer, negative heads either encode confounder-specific circuits (as in LTC) or degrade representation quality for retrieval/classification tasks (as in AAT). Mean ablation or contrastive manipulation of these heads, with targeted reinforcement of salient ones, eliminates spurious bias and improves worst-group accuracy by 23–50 percentage points, while enhancing generalization (Yeo et al., 23 May 2025, Lin et al., 1 Jul 2025).
4. Targeted Correction and Suppression Techniques
Multiple correction strategies for negative heads have been advanced:
- Negative Attention Score Alignment (NASA) (Yu et al., 31 Jul 2024): A lightweight, sequential fine-tuning targeting only the Q/K matrices of negative heads, using contrastively recast training data. Early stopping and validation-based checks are used to prevent over-correction.
- Locate-Then-Correct (LTC) (Yeo et al., 23 May 2025): Mechanistic identification of spurious heads (via logit-lens decomposition and contrastive group analysis), followed by mean ablation of spurious heads and orthogonal projection-based injection of class-discriminative signals into salient heads—all without finetuning.
- Attention Ablation Technique (AAT) (Lin et al., 1 Jul 2025): Either genetic algorithm-based or backpropagation-based optimization is used to select negative heads; these are then masked during inference by suppressing their attention weights over non-CLS tokens. The approach is effective for edge, low-data, and large-scale scenarios.
- Causal Head Gating (CHG) (Nam et al., 19 May 2025): Learnable gating for all heads (with differential L1 regularization), classifying heads into facilitating, interfering (negative), or irrelevant based on their isolated effect on model loss.
- Contrastive CoRe selection/pruning (Tran et al., 2 Oct 2025): Aggregate attention only from high-contrastive-score heads, optionally pruning late layers for efficiency with minimal performance degradation.
| Method | Target Context | Correction Mechanism |
|---|---|---|
| NASA | LLMs, binary tasks | Q/K fine-tune on negative heads |
| LTC | ViT-CLIP | Mean-ablation, orthogonal projection |
| AAT | ViT-CLIP | Attention suppression (masking) |
| CHG | LLMs, general | Soft gating + ablation |
| CoRe | LLM rerankers | Head selection + layer pruning |
5. Empirical Impact and Observed Patterns
Eliminating negative attention heads consistently yields:
- Reduced negative bias: Models become less over-cautious; recall and calibration improve, F1 scores rise (e.g., up to +0.15 in LLM binary tasks) (Yu et al., 31 Jul 2024).
- Enhanced robustness and fairness: Post-correction, worst-group accuracy improves by >50% in bias-sensitive CLIP benchmarks (Yeo et al., 23 May 2025).
- Improved retrieval accuracy: Restricting to high-contrastive-score (CoRe) heads boosts nDCG@10 by up to +4 points; best performance is attained with 1–2% of heads (Tran et al., 2 Oct 2025).
- Efficiency and parsimony: Results show that relatively few negative heads exist consistently (e.g., 0.1–1% always interfering in Llama 3), and only hundreds of parameters need adjustment for substantial gains (Nam et al., 19 May 2025, Yu et al., 31 Jul 2024).
- Layer dependence: Negative heads typically cluster in shallow and deep layers in vision models, and middle layers in retrieval models. Heads in the mid-layers of transformers are most informative for ranking; early/late layers often host redundant or detrimental heads (Lin et al., 1 Jul 2025, Tran et al., 2 Oct 2025).
6. Insights, Generalization, and Limitations
- Query and data-agnosticity: Negative heads often manifest independently of the specific input, revealing global, model-internal failure modes or inductive biases (Yu et al., 31 Jul 2024).
- Domain and data drift: Some negative heads encode spurious domain-specific features that degrade out-of-domain generalization (Lin et al., 1 Jul 2025).
- Scalability and practical application: Correction approaches such as AAT-GA and LTC are training-free or require only minimal data/compute, making them suitable for large models and on-device deployment (Yeo et al., 23 May 2025, Lin et al., 1 Jul 2025).
- Limitation on root cause understanding: The origins of persistent negative heads remain unclear, with hypotheses including capacity bottleneck, optimization artifacts, or inductive alignment failures. Current correction methods are effective but not explanatory (Yu et al., 31 Jul 2024).
- Scope of applicability: Existing techniques predominantly address binary, classification, or retrieval tasks; extension to free-form, multi-span, or generative outputs is an open direction (Yu et al., 31 Jul 2024).
7. Relationship to Attention Diversity and Feature Collapse
Negative attention heads are conceptually distinct from generic redundancy or “attention collapse” but overlap in their detrimental effects. While classical diversity methods (e.g., regularization, repulsive attention) aim to spread features and prevent redundancy (An et al., 2020), negative head suppression is specifically concerned with heads that exert harmful, bias-amplifying, or confounder-propagating effects. Direct suppression or ablation (as in CHG or AAT) can therefore be viewed as a complementary procedure to diversity-promoting regularization, improving both representation quality and model trustworthiness. Empirical findings support that such targeted correction yields more pronounced and interpretable gains than global regularization alone (Yeo et al., 23 May 2025, Lin et al., 1 Jul 2025, Tran et al., 2 Oct 2025, Yu et al., 31 Jul 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free