Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling

Published 20 Apr 2026 in cs.AI | (2604.18103v1)

Abstract: Prefilling computational costs pose a significant bottleneck for LLMs and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward \textit{semantic fixing points}, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces DASH, a training-free token halting mechanism that uses delta-attention dynamics to identify and prune redundant tokens.
It demonstrates enhanced performance on long-context and vision-language benchmarks by significantly improving efficiency with minimal loss in accuracy.
DASH balances computation and effectiveness by retaining tokens with high semantic evolution, validated through extensive empirical experiments.

DASH: Delta Attention Selective Halting for Efficient Long-Context Prefilling

Introduction

The computational cost imposed by long context processing remains a limiting factor in deploying LLMs and LMMs for tasks requiring extended contexts, such as document-level summarization and vision-language understanding. Existing methods for computational reduction, especially token pruning, often depend on importance heuristics that are fundamentally incompatible with hardware-optimized attention kernels like FlashAttention. This work introduces Delta Attention Selective Halting (DASH), a novel, training-free inference-time policy leveraging the dynamic evolution of self-attention updates for efficient conditional computation during prefill. The method exploits token-level stabilization dynamics, identifying tokens for which further processing is redundant and halting their forward propagation, thus yielding substantial reductions in computation without retraining or loss of kernel compatibility.

Delta Attention as a Robust Proxy for Token Redundancy

The central premise in DASH is that, in deep transformers, most tokens quickly converge to semantic fixed points across layers—they are characterized by diminishing residual updates in the attention subbranch. This is evidenced by the highly skewed distribution of per-token $\Delta_{\text{attn}}$ (layer-wise attention updates): the majority of tokens accumulate near-zero update magnitudes, while a minority continue to evolve.

Figure 1: Layer-wise distributions of token-wise relative $\Delta_{\mathrm{attn}}$ , highlighting that most tokens rapidly reach negligible updates, indicating redundancy.

Analysis demonstrates a positive correlation between the average delta attention magnitude and token importance, as measured by their cumulative downstream attention weights. Tokens with minimal $\Delta_{\text{attn}}$ become information sinks, ceasing to influence subsequent computation and thus admitting safe halting.

Figure 2: Correlation between normalized final-layer attention scores and mean $\Delta_{\text{attn}}$ . Stabilized (low-delta) tokens are rarely attended to afterward, validating their redundancy.

Further, modality-specific analysis reveals that visual tokens—owing to the spatial redundancy in their feature representation—stabilize significantly earlier than textual ones, necessitating distinct strategies for language and vision inputs.

Figure 3: Layer-wise sparsity for visual (left) and textual (right) tokens, demonstrating earlier saturation of vision tokens.

Method: DASH Token Halting Mechanism

DASH operates via a single-shot, layer-local delta-attention signal $\Delta_t^{(l)} = \lVert U^{(l)}_t \rVert_2$ computed at a preselected start layer $l_s$ . Upon reaching $l_s$ , tokens are ranked by $\Delta_{\text{attn}}$ , and only the top $K$ tokens (defined by a pruning ratio $\rho$ ) are propagated beyond $\Delta_{\mathrm{attn}}$ 0—all others are halted, skipping subsequent attention and FFN computation. The halting decision is applied uniformly across modalities, and the resulting active set is held fixed for all deeper layers, ensuring compatibility with efficient attention implementations.

Figure 4: Overview of DASH. At layer $\Delta_{\mathrm{attn}}$ 1, only tokens with highest $\Delta_{\mathrm{attn}}$ 2 are retained for further computation; others are halted, producing computation savings while preserving capacity for critical context aggregation.

This simple, training-free, and hardware-efficient policy can be tuned via $\Delta_{\mathrm{attn}}$ 3 and $\Delta_{\mathrm{attn}}$ 4 to trade off accuracy and efficiency.

Experimental Validation

Long-Context Language Results

On Qwen2.5-7B-Instruct-1M, evaluation over comprehensive benchmarks (LongBench-E, LooGLE) establishes that DASH consistently outperforms state-of-the-art prefill compression baselines, including SnapKV, D $\Delta_{\mathrm{attn}}$ 5, LLMLingua2, and FastV. For LongBench-E, DASH obtains an average aggregate score of 46.76 under strong compression, approaching the uncompressed backbone (48.87), while exceeding SnapKV (46.15) and FastV (43.99).

DASH’s halting strategy is robust across diverse tasks (QA, summarization, code completion), indicating that attention-branch deltas reliably select information-bearing tokens even under aggressive sequence length reduction.

Vision–Language Compression

In the context of vision-LLMs (Qwen2-VL-7B), DASH achieves the highest average decline ratio (ADR) across a range of VL benchmarks and under stringent token reduction regimes, significantly mitigating performance degradation compared to FastV, VisionZip, and DART.

Figure 5: Performance of DASH and baselines on VL token compression under varying reduction ratios. DASH preserves accuracy especially well under strong compression.

End-to-End Efficiency

Crucially, prefill FLOP reductions translate into improved end-to-end inference times. DASH delivers up to 1.74× latency improvement at matched accuracy compared to uncompressed inference, and an 8.5% higher score than the closest competitor (FastV) at equivalent runtime.

Figure 6: Trade-off between LongBench-E performance and E2E time. DASH achieves superior accuracy-latency operating points.

Token-Level Qualitative Analysis

Token-level visualization demonstrates that $\Delta_{\mathrm{attn}}$ 6 at the halting layer assigns higher scores to semantically informative tokens—consistent with the selection of globally relevant context elements, not merely local statistics.

Figure 7: Example showing DASH assigns high $\Delta_{\mathrm{attn}}$ 7 to tokens with substantive semantic content, preserving critical information in active set.

Ablations and Mechanistic Insights

Ablation experiments ascertain that:

Attention-branch deltas offer more faithful relevance signals than block-level output changes. Across all major tasks, halting based on attention-branch $\Delta_{\mathrm{attn}}$ 8 notably outperforms block-wise variants.
The directionality of the criterion is fundamental: pruning tokens with minimal $\Delta_{\mathrm{attn}}$ 9 preserves performance; removing those with high deltas (or randomly) catastrophically degrades accuracy.
Figure 8: Comparison of delta-signal variants for halting. Attention-branch delta (orange) consistently yields the best accuracy across tasks.

Together with correlation analysis, these results attribute DASH’s efficacy to the heavy-tailed distribution of delta values, enabling the method to capture a persistent core of influential tokens while safely eliminating redundant computation. Modality-specific dynamics, evidenced by earlier saturation of visual tokens, motivate adaptive compression regimes for multimodal inputs.

Implications and Future Directions

From a practical standpoint, DASH offers a plug-and-play mechanism for reducing inference latency and cost in long-context settings, compatible with current and future efficient attention kernels. The methodological contribution opens avenues for methodologically-unified, training-free conditional compute in both language and multi-modal transformers. Theoretically, the findings endorse a view of deep LLMs as rapidly converging dynamical systems, with substantial redundancy beyond a mid-network stabilization point.

Extensions might consider dynamic or input-adaptive policies for start layer and keep ratio, finer-grained selection mechanisms (head-wise or modality-aware), and formal integration into model training for improved compressibility and robustness.

Conclusion

DASH constitutes a rigorous, empirically validated, and theoretically motivated approach for harnessing dynamical stabilization in transformers for inference acceleration. By utilizing layer-local delta-attention as a redundancy detector, it enables selective halting of tokens, markedly reducing computational cost with negligible impact on task performance and without modification or retraining of the backbone. This work provides a new perspective on model redundancy, information propagation, and efficient inference—charting a pathway for scalable deployment of long-context generative models.

Reference: "Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling" (2604.18103)

Markdown Report Issue