Adaptive Top-P Attention Compression
- Adaptive Top-P Attention Compression (AttnComp) is a technique that selects the minimal token subset whose cumulative attention mass meets a user-specified threshold p, ensuring key contextual preservation.
- It underpins various frameworks—such as reasoning trajectory compression, long-context self-attention acceleration, and retrieval-augmented generation—offering statistical accuracy guarantees and efficiency gains.
- Empirical results demonstrate improved accuracy and reduced token lengths, marking a shift from fixed-budget approaches to adaptive, model-guided attention compression in large language models.
Adaptive Top-P Attention Compression (AttnComp) refers to a family of attention-guided context and token compression techniques for LLMs, grounded in the use of cumulative-mass ("top-p") selection and model-internal attention signals. AttnComp frameworks adaptively select a minimal subset of context or reasoning steps, or restrict the attention computation itself, such that the preserved set accumulates at least a user-specified fraction of the model's internal attention mass. Leading instantiations cover a range of domains, including long-context language modeling, retrieval-augmented generation, and reasoning chains. The resulting methods offer statistical guarantees on information retention, elimination of fixed compression budgets, and substantial gains in efficiency and accuracy across diverse LLM tasks (Lin et al., 4 Feb 2025, Luo et al., 22 Sep 2025, Singh et al., 2 Oct 2025, Ni et al., 5 Feb 2026).
1. Core Principles and Formal Definition
Central to AttnComp is the replacement of fixed-budget pruning (e.g., top-k selection) with the adaptive top-p criterion. Rather than selecting a fixed number of tokens, candidate steps, or documents, AttnComp retains the smallest subset whose cumulative attention weights reach or exceed a threshold , as derived from the model's own attention probabilities. For any collection of items (tokens, context chunks, retrieved documents, etc.) with associated normalized attention scores , the selected set satisfies
This adaptivity aligns the retained set with the intrinsic distribution of attention, automatically allocating more budget in flat distributions (diffuse attention) and pruning aggressively in peaked ones with concentrated mass (Lin et al., 4 Feb 2025).
2. Algorithmic Frameworks and Domains
AttnComp techniques are instantiated in several foundational contexts:
a. Reasoning Trajectory Compression
In the TRAAC framework, AttnComp operates on model-generated chain-of-thought trajectories, segmenting reasoning into contiguous steps and re-feeding the entire sequence through the model to extract attention weights from the final reasoning delimiter to all tokens. Stepwise scores are aggregated, and adaptive pruning is performed based on task difficulty and distribution uniformity, yielding a compressed trajectory that preserves critical steps (Singh et al., 2 Oct 2025).
b. Long-Context Self-Attention Acceleration
Twilight introduces AttnComp to long-context LLM inference by first applying a base selector (e.g., via block-wise scoring), then imposing a top-p pruning within the candidate set. This restricts the attention kernel to the minimal set meeting cumulative mass , yielding a theoretical error bound on the output proportional to and supporting direct integration into attention primitives of serving infrastructures (Lin et al., 4 Feb 2025).
c. Retrieval-Augmented Generation Context Pruning
In RAG, AttnComp leverages cross-attention from the query to context segments/documents, then selects the minimal set of documents whose cumulative relevance exceeds . This preserves factual accuracy, reduces latency, and provides confidence estimates by monitoring cumulative mass assigned to instruction versus documents (Luo et al., 22 Sep 2025).
d. Hierarchical Attention Pruning
Double-P generalizes AttnComp with two-stage hierarchical pruning: a first stage estimates attention mass over coarse-grained clusters via centroids, followed by token-level refinement in high-impact clusters, both stages applying the top-p principle. This supports near-optimal trade-offs between accuracy and efficiency in very long context settings (Ni et al., 5 Feb 2026).
3. Mathematical Formulation and Algorithm Details
The common computational kernel is outlined as follows:
- Score Extraction: For each candidate item (token, document, step) , compute an importance score via attention mechanisms (e.g., attention probability from query to context, self-attention from output position to past tokens).
- Sorting: Sort items in descending .
- Subset Construction: Select the smallest prefix whose cumulative sum surpasses .
- (Optionally) Multi-Stage Pruning: In hierarchical variants, apply the same strategy first at coarse granularity (e.g., clusters of keys), then refine at the individual item level within selected groups (Ni et al., 5 Feb 2026).
For self-attention pruning, the attention output under top-p compression satisfies:
where is the full attention output and is the value matrix (Lin et al., 4 Feb 2025).
4. Empirical Results and Efficiency Gains
Extensive experiments have established the benefits of AttnComp:
- Reasoning Trajectory Compression: On Qwen3-4B, TRAAC’s AttnComp improves average accuracy by 8.4 points with a 36.8% reduction in reasoning token length relative to the base model. Ablations show that removing compression reduces accuracy by 3.4 points and increases length by 23.8%, while replacing AttnComp with random or least-confidence pruning yields significant accuracy drops (~11 points or ~7 points, respectively) (Singh et al., 2 Oct 2025).
- Long-context LLMs: Twilight achieves up to 98% KV cache entry pruning, 15.4× acceleration in self-attention, and 3.9× decoding latency reduction, with near-zero average accuracy loss across 32K–128K token contexts (LongBench) (Lin et al., 4 Feb 2025). Double-P achieves within 0.2–0.5 points of full attention for LLaMA3.1-8B with up to 1.8× reduction in attention compute and up to 1.26× decoding speedup over the best fixed-k/cluster baseline (Ni et al., 5 Feb 2026).
- Retrieval-Augmented Generation: AttnComp attains an average compression rate of 17× with a +1.9 point gain in accuracy over uncompressed context, 49% end-to-end latency savings, and consistent F1 increases with higher response confidence on RAG benchmarks (Luo et al., 22 Sep 2025).
5. Computational Complexity and Implementation
AttnComp’s main overheads are score computation, sorting, and cumulative sum operations. In long-context LLMs, hierarchical selection reduces complexity from (dense attention) to , where is the number of clusters and the refined token budget. Score estimation can leverage quantization to curtail bandwidth consumption, e.g., with 4-bit INT quantization of keys on GPU (Lin et al., 4 Feb 2025). Efficient kernels, including parallel binary search rather than sorting for thresholding, limit implementation overhead relative to the base attention kernel.
Integration is straightforward for serving stacks supporting custom attention sparsity, including PagedAttention, FlashInfer, and SGLang, and supports both page-level and token-level sparsity primitives (Lin et al., 4 Feb 2025, Ni et al., 5 Feb 2026).
6. Practical Considerations and Limitations
Key practical hyperparameters include the mass threshold (typically in ), minimum attention cutoff , and structural factors such as clustering scheme and candidate selector choice. The top-p criterion is empirically robust but may require tuning in scenarios with highly non-uniform attention spread or cross-modal integration. The Double-P hierarchy addresses the bottleneck of token-level softmax on exceptionally long sequences by amortizing the cost over a coarse-to-fine pipeline (Ni et al., 5 Feb 2026).
In RAG settings, the framework allows simultaneous compression and confidence scoring; confidence is defined as , where is the instruction's attention mass (Luo et al., 22 Sep 2025). This enables response reliability estimation without added inference passes.
7. Impact and Extensions
AttnComp frameworks have catalyzed a shift from static, task-tuned compression budgets to data- and distribution-adaptive, model-guided sparsity in both context selection and attention computation. The statistical error bound of supplies accuracy guarantees absent in heuristic top-k methods. Potential extensions include dynamic -scheduling (per-layer, per-head, or conditioned on inference cost), direct application to cross-attention and multi-modal contexts, and learned or metadata-driven selectors. Hierarchical pathways, as in Double-P, enable further balancing of compression accuracy, compute, and latency (Ni et al., 5 Feb 2026).
AttnComp continues to underpin competitive state-of-the-art results in long-context LLM efficiency and retrieval-augmented reasoning, providing a principled approach to adaptive context and computation management across modern NLP pipelines (Lin et al., 4 Feb 2025, Luo et al., 22 Sep 2025, Singh et al., 2 Oct 2025, Ni et al., 5 Feb 2026).