Papers
Topics
Authors
Recent
Search
2000 character limit reached

When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

Published 29 Apr 2026 in cs.CL | (2604.26412v1)

Abstract: Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mismatch and proposes test-time training (TTT) as a remedy, yet we observe that long-range decay persists even in TTT-trained drafters. We revisit long-range decay from the perspective of context information preservation. In hidden-state reuse, we argue the target hidden state acts as a biased context compression: it aggregates historical token information according to the attention query at the current position, yielding a compact representation optimized for immediate next-token prediction. This compression can suppress information less relevant to the current query but important for later speculative steps. In contrast, the target model's KV cache serves as an explicit context, retaining the complete set of token-wise KV representations. We therefore posit the KV-Reuse Hypothesis: allowing the draft model to reuse the target KV cache can provide richer signals for long-horizon drafting. To test this hypothesis, we introduce KVShot, a diagnostic framework that compares three reuse paradigms: hidden-only, KV-only, and hybrid. Extensive evaluations on Qwen3-8B show that KV-Reuse improves long-range acceptance, although end-to-end speedups remain marginal under current training pipelines. Our analysis identifies two key structural bottlenecks: shallow drafters struggle to estimate target queries accurately, and draft-side KV projections receive sparse gradient signals. These findings suggest that realizing the full potential of KV-aware decoding requires moving beyond TTT toward block-wise training paradigms. By exposing these bottlenecks, KVShot provides a foundational diagnostic testbed and a clear roadmap for designing next-generation inference architectures.

Summary

  • The paper demonstrates that KV cache reuse preserves complete token-wise information to mitigate long-range decay in speculative decoding.
  • It introduces the KVShot diagnostic framework to compare hidden-only, KV-only, and hybrid reuse paradigms through empirical evaluation on Qwen3-8B.
  • The study identifies structural bottlenecks like query estimation challenges and sparse gradient flows that limit overall inference speedup.

KV Cache Reuse in Speculative Decoding: An Information-Preservation Perspective

Motivation and Context

Speculative decoding is a critical acceleration technique for LLM inference, enabling parallel candidate generation and verification to amortize the otherwise sequential autoregressive generation cost. Hidden-state reuse, exemplified by EAGLE and MTP architectures, has become the dominant drafter paradigm. However, these drafters are susceptible to long-range decay: acceptance rates for drafted tokens steeply decline as the speculative step increases, fundamentally limiting attainable speedups. While train-inference mismatch—wherein drafters conditioned on target hidden states must instead rely on recursively self-generated states at inference—has been posited as the primary cause, the persistence of long-range decay in autoregressive test-time training (TTT)-aligned models indicates the existence of deeper structural constraints.

Information Preservation Analysis: Hidden States vs. KV Cache

The paper advances an information-theoretic view of long-range decay, formalizing the distinction between hidden-state and KV cache reuse. Target hidden states are query-dependent compressions: Transformer attention weights aggregate value vectors with respect to the decoding query, optimizing next-token prediction. The result is suppression of tokens deemed less relevant by the query's focus—even if those tokens become crucial for later speculative steps. This renders hidden-state-based drafters an ill-conditioned information recovery problem; reconstructing suppressed signals grows progressively harder with increasing draft depth. Figure 1

Figure 1

Figure 1: Hidden-state recursion highlights how recursive conditioning on self-generated hidden states leads to compounding distributional drift and information loss.

Contrastingly, the target model's KV cache retains all token-wise key-value pairs pre-aggregation, allowing draft models to re-attend to any prefix position using estimated queries. This recasts drafting as function approximation—accurately estimating future queries becomes the bottleneck, not information recovery—allowing every historical token to remain accessible regardless of initial attention suppression. Figure 2

Figure 2: Comparison between hidden-state reuse (biased compression) and KV reuse (re-attention) illustrates superior prefix accessibility and potential information preservation with KV caches.

Three practical distinctions further complicate equivalence between hidden states and KV cache: there is a loss of top-layer KV information in standard hidden-state reuse, non-trivial projection gaps requiring shallow drafters to implicitly learn expensive transformations, and capacity competition when hidden-state-based drafters must simultaneously reconstruct projections and queries.

KVShot Diagnostic Framework and Reuse Paradigms

To isolate these effects, the paper introduces KVShot—a unified diagnostic framework facilitating controlled evaluation of three paradigms: hidden-only reuse (EAGLE/MTP), KV-only reuse (direct target KV cache injection via cross-attention), and hybrid reuse (gated fusion of hidden and KV signals). Figure 3

Figure 3: KVShot architecture variants: KV-only reuse enables direct cross-attention to target cache; hybrid reuse applies a gated delta correction to hidden-state anchor.

Empirical investigation is performed on Qwen3-8B using autoregressive TTT, with acceptance rates and mean accepted tokens (MAT) as primary metrics. The evaluations discriminate effects of representation choice from confounding factors.

Numerical Results and Structural Bottlenecks

KV-only reuse is empirically shown to degrade more gracefully than hidden-state reuse as draft steps increase. However, gains are insufficient to yield significant end-to-end speedup due to structural bottlenecks:

  • Query Estimation Difficulty: Shallow drafters (1-layer) struggle to accurately approximate target queries, constraining step-wise acceptance rates. Increasing drafter depth from 1 to 4 layers yields diminishing returns, indicating query estimation is the primary capacity bottleneck.
  • Sparse Gradients in Draft KV Projection: Under autoregressive TTT, gradient flow through draft-side KV projections is sparse and dominated by a small number of positions, undermining optimization and limiting signal diversity.
  • Gate-Induced Gradient Starvation in Hybrid Models: In hybrid fusion architectures, mean gate value rapidly collapses, starves the cross-attention branch of gradients, and only slowly reopens, resulting in limited learning for the KV pathway. Figure 4

    Figure 4: Gate-induced gradient starvation: Mean gate value rapidly collapses early in training, causing weak signal for the cross-attention (KV) pathway.

Despite hybrid drafters exceeding EAGLE-3 baseline in step-wise MAT (2.54 vs. 2.37), end-to-end improvements are marginal (MAT increase of only 0.03 on large-scale data with additional drafting latency of 5–10%), primarily due to stronger baselines with larger datasets and compressed differences under tree verification.

Practical Implications and Future Directions

The theoretical advantage of KV cache reuse is validated at the step-wise level: it provides more robust long-range context preservation and enables hybrid designs that outperform hidden-only drafters. However, the sequential constraints and sparse optimization dynamics in autoregressive TTT pipelines currently prevent full exploitation of these benefits in wall-clock inference.

This motivates the exploration of block-wise and non-autoregressive training paradigms, such as DFlash, that allow parallel prediction of token blocks, denser gradients for KV projection, and deeper drafters. These settings are theoretically better aligned with the structural needs of KV-aware drafting, offering pathways for unlocking practical acceleration.

Theoretical and Practical Implications

From a theoretical perspective, this work reframes speculative decoding bottlenecks as information-preservation and capacity allocation problems, highlighting that usable information in KV caches is only as valuable as the query estimation power and training signal diversity. Practically, the findings suggest hybrid drafters, though promising, require pipeline-level innovation beyond autoregressive TTT—particularly block-wise training—to realize speedups commensurate with step-wise improvements.

Conclusion

This work delivers a rigorous analysis of long-range decay in speculative decoding, arguing for the KV-reuse hypothesis and substantiating it via the KVShot diagnostic framework. While KV reuse demonstrably alleviates step-wise decay, structural bottlenecks impede end-to-end gains under current pipelines. The central implication is clear: KV caches carry valuable long-range signals, but future drafter architectures and training strategies must explicitly address query estimation, gradient density, and fusion dynamics to translate theoretical advantages into practical inference speedups.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.