- The paper introduces DASH, a training-free token halting mechanism that uses delta-attention dynamics to identify and prune redundant tokens.
- It demonstrates enhanced performance on long-context and vision-language benchmarks by significantly improving efficiency with minimal loss in accuracy.
- DASH balances computation and effectiveness by retaining tokens with high semantic evolution, validated through extensive empirical experiments.
DASH: Delta Attention Selective Halting for Efficient Long-Context Prefilling
Introduction
The computational cost imposed by long context processing remains a limiting factor in deploying LLMs and LMMs for tasks requiring extended contexts, such as document-level summarization and vision-language understanding. Existing methods for computational reduction, especially token pruning, often depend on importance heuristics that are fundamentally incompatible with hardware-optimized attention kernels like FlashAttention. This work introduces Delta Attention Selective Halting (DASH), a novel, training-free inference-time policy leveraging the dynamic evolution of self-attention updates for efficient conditional computation during prefill. The method exploits token-level stabilization dynamics, identifying tokens for which further processing is redundant and halting their forward propagation, thus yielding substantial reductions in computation without retraining or loss of kernel compatibility.
Delta Attention as a Robust Proxy for Token Redundancy
The central premise in DASH is that, in deep transformers, most tokens quickly converge to semantic fixed points across layersโthey are characterized by diminishing residual updates in the attention subbranch. This is evidenced by the highly skewed distribution of per-token ฮattnโ (layer-wise attention updates): the majority of tokens accumulate near-zero update magnitudes, while a minority continue to evolve.
Figure 1: Layer-wise distributions of token-wise relative ฮattnโ, highlighting that most tokens rapidly reach negligible updates, indicating redundancy.
Analysis demonstrates a positive correlation between the average delta attention magnitude and token importance, as measured by their cumulative downstream attention weights. Tokens with minimal ฮattnโ become information sinks, ceasing to influence subsequent computation and thus admitting safe halting.
Figure 2: Correlation between normalized final-layer attention scores and mean ฮattnโ. Stabilized (low-delta) tokens are rarely attended to afterward, validating their redundancy.
Further, modality-specific analysis reveals that visual tokensโowing to the spatial redundancy in their feature representationโstabilize significantly earlier than textual ones, necessitating distinct strategies for language and vision inputs.
Figure 3: Layer-wise sparsity for visual (left) and textual (right) tokens, demonstrating earlier saturation of vision tokens.
Method: DASH Token Halting Mechanism
DASH operates via a single-shot, layer-local delta-attention signal ฮt(l)โ=โฅUt(l)โโฅ2โ computed at a preselected start layer lsโ. Upon reaching lsโ, tokens are ranked by ฮattnโ, and only the top K tokens (defined by a pruning ratio ฯ) are propagated beyond ฮattnโ0โall others are halted, skipping subsequent attention and FFN computation. The halting decision is applied uniformly across modalities, and the resulting active set is held fixed for all deeper layers, ensuring compatibility with efficient attention implementations.
Figure 4: Overview of DASH. At layer ฮattnโ1, only tokens with highest ฮattnโ2 are retained for further computation; others are halted, producing computation savings while preserving capacity for critical context aggregation.
This simple, training-free, and hardware-efficient policy can be tuned via ฮattnโ3 and ฮattnโ4 to trade off accuracy and efficiency.
Experimental Validation
Long-Context Language Results
On Qwen2.5-7B-Instruct-1M, evaluation over comprehensive benchmarks (LongBench-E, LooGLE) establishes that DASH consistently outperforms state-of-the-art prefill compression baselines, including SnapKV, Dฮattnโ5, LLMLingua2, and FastV. For LongBench-E, DASH obtains an average aggregate score of 46.76 under strong compression, approaching the uncompressed backbone (48.87), while exceeding SnapKV (46.15) and FastV (43.99).
DASHโs halting strategy is robust across diverse tasks (QA, summarization, code completion), indicating that attention-branch deltas reliably select information-bearing tokens even under aggressive sequence length reduction.
VisionโLanguage Compression
In the context of vision-LLMs (Qwen2-VL-7B), DASH achieves the highest average decline ratio (ADR) across a range of VL benchmarks and under stringent token reduction regimes, significantly mitigating performance degradation compared to FastV, VisionZip, and DART.
Figure 5: Performance of DASH and baselines on VL token compression under varying reduction ratios. DASH preserves accuracy especially well under strong compression.
End-to-End Efficiency
Crucially, prefill FLOP reductions translate into improved end-to-end inference times. DASH delivers up to 1.74ร latency improvement at matched accuracy compared to uncompressed inference, and an 8.5% higher score than the closest competitor (FastV) at equivalent runtime.
Figure 6: Trade-off between LongBench-E performance and E2E time. DASH achieves superior accuracy-latency operating points.
Token-Level Qualitative Analysis
Token-level visualization demonstrates that ฮattnโ6 at the halting layer assigns higher scores to semantically informative tokensโconsistent with the selection of globally relevant context elements, not merely local statistics.

Figure 7: Example showing DASH assigns high ฮattnโ7 to tokens with substantive semantic content, preserving critical information in active set.
Ablations and Mechanistic Insights
Ablation experiments ascertain that:
Together with correlation analysis, these results attribute DASHโs efficacy to the heavy-tailed distribution of delta values, enabling the method to capture a persistent core of influential tokens while safely eliminating redundant computation. Modality-specific dynamics, evidenced by earlier saturation of visual tokens, motivate adaptive compression regimes for multimodal inputs.
Implications and Future Directions
From a practical standpoint, DASH offers a plug-and-play mechanism for reducing inference latency and cost in long-context settings, compatible with current and future efficient attention kernels. The methodological contribution opens avenues for methodologically-unified, training-free conditional compute in both language and multi-modal transformers. Theoretically, the findings endorse a view of deep LLMs as rapidly converging dynamical systems, with substantial redundancy beyond a mid-network stabilization point.
Extensions might consider dynamic or input-adaptive policies for start layer and keep ratio, finer-grained selection mechanisms (head-wise or modality-aware), and formal integration into model training for improved compressibility and robustness.
Conclusion
DASH constitutes a rigorous, empirically validated, and theoretically motivated approach for harnessing dynamical stabilization in transformers for inference acceleration. By utilizing layer-local delta-attention as a redundancy detector, it enables selective halting of tokens, markedly reducing computational cost with negligible impact on task performance and without modification or retraining of the backbone. This work provides a new perspective on model redundancy, information propagation, and efficient inferenceโcharting a pathway for scalable deployment of long-context generative models.
Reference: "Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling" (2604.18103)