Speculative Decoding (SD)
- Speculative Decoding is a method that decouples token proposal and verification by using a fast draft model followed by a precise target model to accelerate inference.
- It achieves significant speedups by optimizing acceptance rates and parallel token verification, with reported improvements up to 4.5× in some scenarios.
- Advancements such as tailored knowledge distillation (e.g., DistillSpec, AdaSPEC) and dynamic draft-length control enable scalable deployment across edge and distributed systems.
Speculative decoding (SD) is a decoding paradigm for LLMs designed to mitigate the high cost and latency of standard auto-regressive generation. The approach introduces a “draft-then-verify” workflow, decoupling token proposal from verification: a computationally efficient draft model speculates several output tokens in advance, then a larger target model checks these tokens in parallel, ensuring the output sequence preserves the statistical fidelity of full auto-regressive decoding. Recent research has substantially advanced the methodology, scalability, and deployment of SD, enabling applications in both dense and sparse model architectures, heterogeneous vocabularies, resource-constrained devices, and multi-model or edge-networked environments.
1. Principles and Fundamentals of Speculative Decoding
The core SD scheme consists of two stages:
- Drafting: A small, fast model (denoted ) generates a candidate block of tokens (of length %%%%1%%%%).
- Verification: The large, accurate target model () verifies the block, accepting or rejecting each token based on their predictive distributions. The accepted prefix is committed; rejected tokens trigger resampling from calibrated distributions or force fallback to the target model’s outputs.
A canonical lossless SD step (Algorithm 1 in (Zhou et al., 2023)) validates draft token by accepting it with probability
where in standard SD. The procedure is iterated until the end of sequence.
Acceptance Rate and Efficiency: The performance of SD is fundamentally linked to the acceptance rate (per token and per block), and thus to the statistical alignment between draft and target models. Formally, the token-level acceptance rate at time step is given by
where TVD denotes total variation distance. Higher alignment (lower TVD) implies fewer rejections, thus greater parallelism and speedup. The expected overall speedup is a function of the block efficiency and the cost ratio between draft and target models :
2. Model Alignment and Knowledge Distillation Strategies
The success of SD depends critically on how well the draft model’s predictive distribution approximates the target’s, especially under the actual on-policy generation. Initial works used standard knowledge distillation (minimizing forward KL divergence everywhere), but recent studies argue this is suboptimal (Zhou et al., 2023, Hu et al., 22 Oct 2025).
DistillSpec (Zhou et al., 2023):
- On-policy data generation: Instead of training on fixed datasets, drafting is guided using examples sampled from the draft model’s own outputs. Theoretical analysis shows that aligning the model on on-policy data directly improves sequence-level acceptance rates.
- Tailored divergence objective: DistillSpec systematically explores divergence functions (forward KL, reverse KL, Jensen–Shannon, TVD), revealing that the “best” divergence function for KD depends on the decoding strategy and downstream task. For instance, reverse KL is effective in temperature sampling while forward KL is preferable in greedy decoding.
AdaSPEC (Hu et al., 22 Oct 2025):
- Selective knowledge distillation: Instead of minimizing divergence uniformly, AdaSPEC uses a reference model to identify “hard-to-fit” tokens and filters them out during training, enabling the draft model to focus its limited capacity on tokens for which alignment is most feasible. The result is consistently higher acceptance rates (up to 15% absolute improvement over DistillSpec) and 10–20% walltime decoding speedup, empirically validated across multiple tasks.
Both methods depart from conventional KD by explicitly optimizing for acceptance and alignment, rather than global divergence, directly translating into higher block efficiency and practical generation speed.
3. Architecture and System Design Variants
Classic SD: Employs a small, specialist drafter and a large frozen target model (Zhou et al., 2023, Byun et al., 14 Oct 2025).
Self-speculative and multi-target models:
- Early-exit/self-speculative approaches utilize shallow (i.e., truncated) versions of the target model as drafters, exploiting cached intermediate representations (Zarch et al., 8 Apr 2025, Zhong et al., 30 May 2024).
- S2D (Kavehzadeh et al., 2 Jul 2024): Introduces “sorted fine-tuning” to create sub-models within the same network, organized by depth, serving multiple target models adaptively. Draft selection is context- and confidence-driven, optimizing for cross-model efficiency in multi-target scenarios.
Hierarchical/multi-level SD (PyramidSD) (Byun et al., 14 Oct 2025):
- Introduces an intermediate “qualifier” model between the drafter and target, enabling the use of extremely small drafters by bridging the distributional gap across sizes.
- Employs stagewise, fuzzy-acceptance criteria governing passage through each layer.
Quantized and heterogeneous vocabularies:
- ML-SpecQD (Georganas et al., 17 Mar 2025): Replaces the need for a fully retrained small model with a quantized version (MXFP4) of the target, directly cast via weight-only quantization (WOQ) without retraining. This architecture enables plug-and-play SD deployment with minimal overhead and supports recursive speculation (multi-level cascading) to further accelerate draft computation.
- Heterogeneous vocabulary SD (Timor et al., 31 Jan 2025): Proposes three algorithms that enable SD between models with non-matching vocabularies (e.g., off-the-shelf drafters): string-level exact matching (SLEM), token-level intersection (TLI), and string-level rejection sampling (SLRS). These methods are proved lossless and eliminate the prior requirement that drafters and targets be tokenizer-aligned.
4. Dynamic and Parallel Speculative Strategies
Recent work explores adaptations beyond static, sequential block-wise SD to maximize device utilization and reduce latency further.
- Dynamic draft-length control: PEARL (Liu et al., 13 Aug 2024) and SVIP (Zhang et al., 27 Nov 2024) introduce adaptive mechanisms that adjust the draft length at runtime, leveraging acceptance signals and token-level entropy respectively. For example, SVIP sets the stopping criterion for drafting based on the expected acceptance probability lower bound estimated from the entropy of the draft distribution:
resulting in up to 20% walltime improvements over fixed-length policies.
- Branch parallelism (Shen et al., 16 May 2025): SpecBranch enables multiple speculative branches to be executed and verified in parallel, using a hybrid rollback-aware drafting structure (H-RAD) that fuses implicit drafter confidence with explicit target model features. This approach achieves up to 4.5× speedup and 50% reduction in rollback tokens on poorly aligned model pairs by parallelizing branch speculation and minimizing recomputation cost.
- Batch and memory optimization: Methods such as SPIRe (Neelam et al., 8 Apr 2025) employ sparse KV caches, pruned target-initialization, and feedback memory to enable high-throughput, low-latency SD in large-batch, long-context serving regimes.
5. System-Level Deployment and Edge Applications
Distributed and edge inference (Zhu et al., 13 Oct 2025):
- SD is extended to distributed, heterogeneous edge settings, with small drafters on resource-constrained edge nodes (SBS) generating draft tokens, and powerful verify-models at macro base stations (MBS) verifying in parallel. The system incorporates pipeline parallelism and batching to maximize throughput.
- Optimization problems for task batching, speculation length, and wireless resource allocation are analyzed and solved via dynamic programming and closed-form allocation, enabling up to 44.9% serving latency reduction over autoregressive edge serving.
Sparse model architectures (MoE) (Huang et al., 26 May 2025):
- Contrary to previous assumptions, SD is shown to be highly effective for sparse Mixture-of-Experts models, since batch sizes required to activate all experts simultaneously (maximizing target efficiency) are actually moderate in practice. As MoE sparsity increases, the batch size range for effective SD broadens, with speedups up to 2.29× demonstrated on large (sparse) MoEs.
6. Lossy, Contrastive, and Hybrid Approaches
Lossy SD (Zhou et al., 2023):
- Relaxing acceptance criteria with lenience functions (e.g., ) allows finer control over the quality-latency trade-off, where intentional small drops in output quality (measured empirically as negligible) result in dramatic reduction in latency (6–10×).
Contrastive SD (Yuan et al., 2023):
- Combines speculative and contrastive decoding by defining acceptance in terms of the difference between target and amateur model probabilities, leading to improved quality (as demonstrated by higher diversity and MAUVE scores), and acceleration without the typical trade-off.
7. Future Directions and Open Problems
- Divergence tuning: Further exploration of divergence functions or adaptive divergence strategies for knowledge distillation is suggested as a potential lever for improved draft–target alignment (Zhou et al., 2023, Xia et al., 1 Mar 2025).
- Training-aware self-speculation (Bhansali et al., 6 Oct 2025): Integrates continual online learning (KL→RL schedule) where verifier accept/reject feedback is recursively used to improve the drafter within the same LLM during inference; achieves state-of-the-art lossless speedups with orders-of-magnitude lower data requirements than EAGLE-2.
- Flexible system integration: SD methods are being incorporated into major open-source frameworks (e.g., Hugging Face Transformers), broadening their accessibility and ensuring practical relevance in heterogeneous and high-load serving contexts (Timor et al., 31 Jan 2025).
- Adaptive and context-aware parameterization: Dynamic selection of architecture-specific parameters, such as exit layer and draft length, improves context- and task-specific efficiency (Zarch et al., 8 Apr 2025).
- Robustness under distribution drift: Training-aware, continual or online draft alignment techniques and explicit drift detection are active areas of investigation (Bhansali et al., 6 Oct 2025).
- Composition with model compression, retrieval, and batched/tree-based speculation: Hybrid approaches that combine SD with retrieval-based drafts (Hu et al., 16 Nov 2024), tree-structured candidate generation, or batch-parallel rollout are seen as promising avenues.
In summary, speculative decoding now encompasses a broad class of techniques for accelerating LLM inference across model architectures, deployment settings, and use-case demands. The paradigm’s development is characterized by advances in draft–target alignment through selective distillation, context-adaptive control of speculation strategies, efficient memory and resource usage, composability with quantization and vocabulary heterogeneity, and integration into distributed or batch-serving architectures. These innovations position SD as a foundational approach for scalable, low-latency, high-throughput LLM deployment.