Attention-Debiased Pruning (AdaTP)
- The paper introduces AdaTP, a post-training pruning method that recalibrates attention scores to address positional, semantic, and redundancy biases, achieving up to 88.9% FLOP reduction.
- AdaTP leverages a dual-stage token selection process—pivot selection followed by maximal independent set—to ensure structural diversity and maintain high performance on challenging localization tasks.
- The framework also includes attention head pruning using surrogate models and multi-objective optimization to mitigate social biases while sustaining model utility.
Attention-Debiased Pruning (AdaTP) comprises a family of post-training pruning techniques targeting attention-based overparameterization and bias artifacts in large language and multimodal models. These methods correct for quantifiable attention biases during inference, yielding substantial reductions in FLOPs and memory requirements while maintaining high task fidelity—particularly in challenging settings such as long-context video or fine-grained localization. AdaTP has been realized in both token-pruning algorithms for visual and video LLMs and attention head-pruning for bias mitigation in LLMs, with each instantiation tailored to address the specific structure of attention-induced redundancy and bias.
1. Motivation and Problem Setting
Modern multimodal and video LLMs encode visual or video inputs into long token sequences, incurring prohibitive computational overhead at inference time. Conventional token pruning pipelines, which select tokens for retention based on raw attention scores, reveal systematic biases:
- Positional and border bias: Raw attention scores are dominated by artifacts that favor specific positions (e.g., borders or the ends of sequences) independent of semantic relevance (Zhang et al., 22 Dec 2025, Sun et al., 26 May 2025).
- Local redundancy: Attention is often over-concentrated on the same spatial locations (patches) across frames or images, reducing diversity in the retained token set (Sun et al., 26 May 2025).
- Spatial and semantic redundancy: Importance-based methods ignore structural redundancy, preserving clusters of highly similar or spatially adjacent tokens (Zhang et al., 22 Dec 2025).
Furthermore, in language-only LLMs, undesirable social biases (e.g., gender and race) can be perpetuated by certain attention heads. Prior methods that prune attention heads or neurons post hoc for fairness lack principled bias quantification and balancing with model utility (Dasu et al., 20 Mar 2025).
AdaTP directly addresses these issues by quantifying and removing attention biases, either in the selection of informative tokens (vision, video) or in the pruning of attention heads (language), yielding efficient models that retain both utility and fairness.
2. Debiasing Methodologies
2.1 Attention Score Debiasing for Visual and Video LLMs
In image and video LLMs, AdaTP computes token importance using attention maps from the model's token–token self-attention modules. The process includes:
- Bias estimation: A position-dependent prior is generated by averaging attention weights across 1,000 random images under a neutral prompt (Zhang et al., 22 Dec 2025). In video LLMs, global and local bias metrics are explicitly defined and visualized (Sun et al., 26 May 2025).
- Debiasing transformation: At inference, the raw attention for token is divided by the corresponding prior and stabilized via a small :
The resulting debiased attention score, , is used for token ranking.
- Global and local modules: In video LLMs, AdaTP implements distinct debiasing modules. The global module allocates higher retention in text-relevant segments (based on cosine similarity between frame- and prompt-features), and the local module deduplicates tokens at the same patch position across frames (Algorithm 1 in (Sun et al., 26 May 2025)).
2.2 Hybrid Graph and Structural Diversity
To address structural redundancy, AdaTP constructs a hybrid undirected graph with tokens:
- Semantic similarity: Edges encode cosine similarity among token features, normalized linearly across the batch.
- Spatial proximity: An 8-connected grid encodes adjacency among spatial patches.
- Graph fusion: A weighted sum , followed by thresholding, yields the final adjacency (Zhang et al., 22 Dec 2025).
2.3 Attention Head Pruning for Fairness in LLMs
In the context of LLM fairness, “Attention-Debiased Pruning” denotes the identification and pruning of attention heads that disproportionately contribute to bias:
- Surrogate models: Small neural networks, and , are trained to predict bias and utility (perplexity) given an attention head pruning mask (Dasu et al., 20 Mar 2025).
- Multi-objective optimization: Simulated annealing on the Boolean hypercube of head-masks seeks a trade-off minimizing a weighted sum of predicted bias and perplexity increases, with temperature scheduling and neighbor sampling guaranteeing coverage.
- Post-hoc pruning: The optimal mask is applied to deactivate heads; the procedure is entirely post-training and requires no model finetuning.
3. Pruning Algorithms and Architectural Schemes
3.1 Token Selection: Pivot + Maximal Independent Set
AdaTP’s token selection proceeds in two stages:
- Pivot selection: Select a fixed ratio (e.g., ) of the highest-importance tokens as “pivots.” After each selection, both the pivot and its neighbors (per ) are removed from eligibility.
- MIS selection: The remaining tokens are iteratively chosen to maximize importance under the constraint that no two retained tokens are adjacent in (greedy MIS). This dual-stage procedure preserves both highly informative and structurally diverse tokens, ensuring spatial and semantic distribution.
3.2 AdaTP Pipeline for Video LLMs
In video models, AdaTP operates across layers:
- Progressive reduction: From the second up to -th layer ( total), the fraction of tokens is reduced according to a schedule (down to at deepest layer). Key parameters include segmentation thresholds ( for frame similarity, for text relevance) and retention boosts ().
- Segment-based retention: Significant video segments (text-relevant) retain a higher fraction of tokens (), others retain .
- Local deduplication: Within segments, retained tokens are deduplicated at each spatial patch position to eliminate local redundancy.
3.3 Pruning Attention Heads with Fairness Constraints
For LLMs, AdaTP leverages two surrogates to rapidly evaluate the effect of head-pruning masks on bias and perplexity. Simulated annealing searches for masks that minimize combined cost under fixed pruning bounds. Accepted masks are those that meet or beat prior cost, or probabilistically according to the annealing schedule.
4. Computational Complexity and Empirical Results
Complexity Analysis
- Graph construction: for pairwise similarities and adjacency, mitigated by sparsity and batching (Zhang et al., 22 Dec 2025).
- Token selection: Pivot sort , MIS stage up to overall (dominated by sparsity).
- Self-attention FLOPs: Reduced from to , with text tokens. For instance, retaining of original tokens reduces FLOPs by (Zhang et al., 22 Dec 2025).
- KV cache usage: Scaled down in proportion to pruned tokens.
Empirical Trade-Offs
| Scenario | Tok % | Performance % | FLOPs ↓ |
|---|---|---|---|
| LLaVA-1.5-7B, general | 33.3 | 99.9 | –74.2 |
| LLaVA-1.5-7B, general | 22.2 | 99.2 | –80.6 |
| LLaVA-1.5-7B, general | 11.1 | 97.3 | –88.9 |
| InternVL-2.5-8B, loc. | 50 | 98.4 | — |
| InternVL-2.5-8B, loc. | 25 | 94.7 | — |
| InternVL-2.5-8B, loc. | 10 | 85.7 | — |
| LLaVA-OneVision-7B, vid. | 27.3 | vanilla | –72.7 |
AdaTP typically sustains performance on general tasks down to token retention and maintains accuracy on challenging localization with token reduction (Zhang et al., 22 Dec 2025, Sun et al., 26 May 2025).
Fairness Pruning in LLMs
AdaTP achieves up to reduction in measured gender bias with only modest increases in perplexity (e.g., on GPT-J-6B) (Dasu et al., 20 Mar 2025). Surrogate models provide a speedup of over direct evaluation.
5. Ablations, Insights, and Comparative Analysis
Component Ablations
- Debiasing only: Corrects for positional and border bias, outperforming raw-attention-only pruning, especially at high pruning ratios.
- Structural diversity: Adding MIS-based selection over the hybrid graph yields 5–10 point improvements in localization (and 1–2 in general), reducing semantic and spatial redundancy (Zhang et al., 22 Dec 2025).
- Spatial adjacency: Critical for localization; removing spatial edges reduces grounding accuracy by 3–5 points.
In video LLMs, ablations show additive improvements from segmentation, global debiasing, and local deduplication (Sun et al., 26 May 2025).
Comparison with Prior Work
AdaTP consistently outperforms prior naive pruning pipelines:
- FastV, VisionZip, and Dycoke employ basic attention-based or segmentation pruning but fail to address or even exacerbate attention bias, especially global endpoint focus and local patch redundancy (Sun et al., 26 May 2025).
- On both efficiency and accuracy, AdaTP achieves superior average performance at comparable or lower FLOPs (Zhang et al., 22 Dec 2025, Sun et al., 26 May 2025).
- For attention head pruning, AdaTP's surrogate-annealing search achieves better bias–utility trade-offs than FASP and yields cross-bias reductions in multiple demographic metrics (Dasu et al., 20 Mar 2025).
6. Limitations and Future Directions
Known limitations of current AdaTP instantiations include:
- Surrogate limitations: For head pruning, surrogate approximation errors ( MSE) and significant offline training cost (1,900 GPU-hours) (Dasu et al., 20 Mar 2025).
- Hyperparameter sensitivity: Multiple thresholds and ratio parameters require tuning, though robustness is reported across wide ranges.
- Scalability and model size: Empirical validation is currently lacking on very large-scale (>7B) Video LLMs (Sun et al., 26 May 2025).
- Extension to broader tasks: Applicability to robustness, privacy, or more diverse bias categories is an open question.
Proposed extensions include learnable debiasing weights, active-learning-driven surrogate construction, adaptive pruning schedules, and integration with dynamic inference mechanisms such as early-exit or on-device video assistants.
7. Conclusion and Significance
Attention-Debiased Pruning (AdaTP) introduces a rigorous framework for correcting attention-induced biases in both the spatial/temporal and parameter domains of large models. By combining calibrated attention debiasing, hybrid graph-based structural pruning, and fairness-constrained head pruning, AdaTP achieves substantial reductions in computational expense and memory usage while preserving (and sometimes exceeding) original model performance across multiple benchmarks. The technique enables efficient inference for large-scale multimodal and video LLMs and supports responsible deployment in fairness-sensitive applications, with demonstrated, state-of-the-art performance across general understanding, fine-grained localization, and fairness metrics (Zhang et al., 22 Dec 2025, Sun et al., 26 May 2025, Dasu et al., 20 Mar 2025).