Adaptive Top-P Attention Compression

Updated 25 March 2026

Adaptive Top-P Attention Compression (AttnComp) is a technique that selects the minimal token subset whose cumulative attention mass meets a user-specified threshold p, ensuring key contextual preservation.
It underpins various frameworks—such as reasoning trajectory compression, long-context self-attention acceleration, and retrieval-augmented generation—offering statistical accuracy guarantees and efficiency gains.
Empirical results demonstrate improved accuracy and reduced token lengths, marking a shift from fixed-budget approaches to adaptive, model-guided attention compression in large language models.

Adaptive Top-P Attention Compression (AttnComp) refers to a family of attention-guided context and token compression techniques for LLMs, grounded in the use of cumulative-mass ("top-p") selection and model-internal attention signals. AttnComp frameworks adaptively select a minimal subset of context or reasoning steps, or restrict the attention computation itself, such that the preserved set accumulates at least a user-specified fraction $p$ of the model's internal attention mass. Leading instantiations cover a range of domains, including long-context language modeling, retrieval-augmented generation, and reasoning chains. The resulting methods offer statistical guarantees on information retention, elimination of fixed compression budgets, and substantial gains in efficiency and accuracy across diverse LLM tasks (Lin et al., 4 Feb 2025, Luo et al., 22 Sep 2025, Singh et al., 2 Oct 2025, Ni et al., 5 Feb 2026).

1. Core Principles and Formal Definition

Central to AttnComp is the replacement of fixed-budget pruning (e.g., top-k selection) with the adaptive top-p criterion. Rather than selecting a fixed number $k$ of tokens, candidate steps, or documents, AttnComp retains the smallest subset whose cumulative attention weights reach or exceed a threshold $p \in (0,1)$ , as derived from the model's own attention probabilities. For any collection of items (tokens, context chunks, retrieved documents, etc.) with associated normalized attention scores $P_j$ , the selected set $I_{top-p}$ satisfies

$\sum_{j\in I_{top-p}} P_j \geq p \quad \text{and} \quad |I_{top-p}| \text{ minimal}$

This adaptivity aligns the retained set with the intrinsic distribution of attention, automatically allocating more budget in flat distributions (diffuse attention) and pruning aggressively in peaked ones with concentrated mass (Lin et al., 4 Feb 2025).

2. Algorithmic Frameworks and Domains

AttnComp techniques are instantiated in several foundational contexts:

a. Reasoning Trajectory Compression

In the TRAAC framework, AttnComp operates on model-generated chain-of-thought trajectories, segmenting reasoning into contiguous steps and re-feeding the entire sequence through the model to extract attention weights from the final reasoning delimiter to all tokens. Stepwise scores are aggregated, and adaptive pruning is performed based on task difficulty and distribution uniformity, yielding a compressed trajectory that preserves critical steps (Singh et al., 2 Oct 2025).

b. Long-Context Self-Attention Acceleration

Twilight introduces AttnComp to long-context LLM inference by first applying a base selector (e.g., via block-wise scoring), then imposing a top-p pruning within the candidate set. This restricts the attention kernel to the minimal set meeting cumulative mass $p$ , yielding a theoretical error bound on the output proportional to $(1-p)$ and supporting direct integration into attention primitives of serving infrastructures (Lin et al., 4 Feb 2025).

c. Retrieval-Augmented Generation Context Pruning

In RAG, AttnComp leverages cross-attention from the query to context segments/documents, then selects the minimal set of documents whose cumulative relevance exceeds $p$ . This preserves factual accuracy, reduces latency, and provides confidence estimates by monitoring cumulative mass assigned to instruction versus documents (Luo et al., 22 Sep 2025).

d. Hierarchical Attention Pruning

Double-P generalizes AttnComp with two-stage hierarchical pruning: a first stage estimates attention mass over coarse-grained clusters via centroids, followed by token-level refinement in high-impact clusters, both stages applying the top-p principle. This supports near-optimal trade-offs between accuracy and efficiency in very long context settings (Ni et al., 5 Feb 2026).

3. Mathematical Formulation and Algorithm Details

The common computational kernel is outlined as follows:

Score Extraction: For each candidate item (token, document, step) $j$ , compute an importance score $P_j$ via attention mechanisms (e.g., attention probability from query to context, self-attention from output position to past tokens).
Sorting: Sort items in descending $P_j$ .
Subset Construction: Select the smallest prefix whose cumulative sum $\sum P_j$ surpasses $p$ .
(Optionally) Multi-Stage Pruning: In hierarchical variants, apply the same strategy first at coarse granularity (e.g., clusters of keys), then refine at the individual item level within selected groups (Ni et al., 5 Feb 2026).

For self-attention pruning, the attention output under top-p compression $\tilde O$ satisfies:

$\|\mathbf{O} - \tilde{\mathbf{O}}\|_2 \leq (1-p)\|\mathbf{V}\|_2$

where $\mathbf{O}$ is the full attention output and $\mathbf{V}$ is the value matrix (Lin et al., 4 Feb 2025).

4. Empirical Results and Efficiency Gains

Extensive experiments have established the benefits of AttnComp:

Reasoning Trajectory Compression: On Qwen3-4B, TRAAC’s AttnComp improves average accuracy by 8.4 points with a 36.8% reduction in reasoning token length relative to the base model. Ablations show that removing compression reduces accuracy by 3.4 points and increases length by 23.8%, while replacing AttnComp with random or least-confidence pruning yields significant accuracy drops (~11 points or ~7 points, respectively) (Singh et al., 2 Oct 2025).
Long-context LLMs: Twilight achieves up to 98% KV cache entry pruning, 15.4× acceleration in self-attention, and 3.9× decoding latency reduction, with near-zero average accuracy loss across 32K–128K token contexts (LongBench) (Lin et al., 4 Feb 2025). Double-P achieves within 0.2–0.5 points of full attention for LLaMA3.1-8B with up to 1.8× reduction in attention compute and up to 1.26× decoding speedup over the best fixed-k/cluster baseline (Ni et al., 5 Feb 2026).
Retrieval-Augmented Generation: AttnComp attains an average compression rate of 17× with a +1.9 point gain in accuracy over uncompressed context, 49% end-to-end latency savings, and consistent F1 increases with higher response confidence on RAG benchmarks (Luo et al., 22 Sep 2025).

5. Computational Complexity and Implementation

AttnComp’s main overheads are score computation, sorting, and cumulative sum operations. In long-context LLMs, hierarchical selection reduces complexity from $O(Nd)$ (dense attention) to $O(Cd + N_{exact}d)$ , where $C$ is the number of clusters and $N_{exact}$ the refined token budget. Score estimation can leverage quantization to curtail bandwidth consumption, e.g., with 4-bit INT quantization of keys on GPU (Lin et al., 4 Feb 2025). Efficient kernels, including parallel binary search rather than sorting for thresholding, limit implementation overhead relative to the base attention kernel.

Integration is straightforward for serving stacks supporting custom attention sparsity, including PagedAttention, FlashInfer, and SGLang, and supports both page-level and token-level sparsity primitives (Lin et al., 4 Feb 2025, Ni et al., 5 Feb 2026).

6. Practical Considerations and Limitations

Key practical hyperparameters include the mass threshold $p$ (typically in $[0.9, 0.98]$ ), minimum attention cutoff $\epsilon$ , and structural factors such as clustering scheme and candidate selector choice. The top-p criterion is empirically robust but may require tuning in scenarios with highly non-uniform attention spread or cross-modal integration. The Double-P hierarchy addresses the bottleneck of token-level softmax on exceptionally long sequences by amortizing the cost over a coarse-to-fine pipeline (Ni et al., 5 Feb 2026).

In RAG settings, the framework allows simultaneous compression and confidence scoring; confidence is defined as $1-s_{ins}$ , where $s_{ins}$ is the instruction's attention mass (Luo et al., 22 Sep 2025). This enables response reliability estimation without added inference passes.

7. Impact and Extensions

AttnComp frameworks have catalyzed a shift from static, task-tuned compression budgets to data- and distribution-adaptive, model-guided sparsity in both context selection and attention computation. The statistical error bound of $(1-p)\|\mathbf{V}\|_2$ supplies accuracy guarantees absent in heuristic top-k methods. Potential extensions include dynamic $p$ -scheduling (per-layer, per-head, or conditioned on inference cost), direct application to cross-attention and multi-modal contexts, and learned or metadata-driven selectors. Hierarchical pathways, as in Double-P, enable further balancing of compression accuracy, compute, and latency (Ni et al., 5 Feb 2026).

AttnComp continues to underpin competitive state-of-the-art results in long-context LLM efficiency and retrieval-augmented reasoning, providing a principled approach to adaptive context and computation management across modern NLP pipelines (Lin et al., 4 Feb 2025, Luo et al., 22 Sep 2025, Singh et al., 2 Oct 2025, Ni et al., 5 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (4)

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning (2025)

AttnComp: Attention-Guided Adaptive Context Compression for Retrieval-Augmented Generation (2025)

Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression (2025)

Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Top-P Attention Compression (AttnComp).

Adaptive Top-P Attention Compression

1. Core Principles and Formal Definition

2. Algorithmic Frameworks and Domains

a. Reasoning Trajectory Compression

b. Long-Context Self-Attention Acceleration

c. Retrieval-Augmented Generation Context Pruning

d. Hierarchical Attention Pruning

3. Mathematical Formulation and Algorithm Details

4. Empirical Results and Efficiency Gains

5. Computational Complexity and Implementation

6. Practical Considerations and Limitations

7. Impact and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Adaptive Top-P Attention Compression

1. Core Principles and Formal Definition

2. Algorithmic Frameworks and Domains

a. Reasoning Trajectory Compression

b. Long-Context Self-Attention Acceleration

c. Retrieval-Augmented Generation Context Pruning

d. Hierarchical Attention Pruning

3. Mathematical Formulation and Algorithm Details

4. Empirical Results and Efficiency Gains

5. Computational Complexity and Implementation

6. Practical Considerations and Limitations

7. Impact and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research