Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 158 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 80 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

DistillSpec: Efficient KD for Speculative Decoding

Updated 6 November 2025
  • DistillSpec is a knowledge distillation method that tailors a compact draft model to mimic a larger target model, enhancing token acceptance in speculative decoding.
  • It leverages on-policy data and task-specific divergence objectives to optimize model alignment, directly boosting acceptance rates during decoding.
  • Empirical benchmarks show 10–45% inference speedups across diverse NLP tasks, achieving reduced latency with minimal performance degradation.

DistillSpec refers to a methodology for knowledge distillation specifically designed to improve the alignment between a compact draft model and a larger target model in the context of speculative decoding for LLMs. Speculative decoding (SD) leverages a lightweight draft model to propose sequences of tokens, which are then verified by a computationally expensive target model. The overall decoding efficiency increases with the acceptance rate—the proportion of draft tokens that are accepted by the target as matching its own distribution. DistillSpec systematically tailors the distillation process to maximize acceptance rates and enables significant reductions in inference latency with minimal performance degradation (Zhou et al., 2023).

1. Speculative Decoding and the Alignment Bottleneck

In speculative decoding, a smaller, faster draft model qq generates proposals (token blocks), and a large, accurate target model pp verifies or rejects the proposals. The efficiency of SD directly depends on the acceptance rate, which is a function of distributional proximity between qq and pp. If qq is poorly aligned with pp, frequent rejections occur that negate the speed benefits of speculative generation. A key insight is that optimizing the draft model solely for standalone task performance is insufficient; maximizing the acceptance rate, i.e., explicitly minimizing distributional discrepancies relevant to SD, is critical.

2. Core Design Principles of DistillSpec

DistillSpec is grounded in two core strategies:

  1. On-Policy Knowledge Distillation: Training the draft model using data generated on-policy—i.e., sequences sampled from the draft model itself. This ensures that the distillation is most effective on the trajectories most likely to be encountered at inference, i.e., where the draft is strongest and where acceptance will have the greatest impact on speed.
  2. Task-Specific Divergence Objectives: Tailoring the divergence function DD between pp and qq to the demands of speculative decoding, rather than generic model accuracy. The choice of loss may include token-wise Forward KL (DKL(pq)D_{\mathrm{KL}}(p \Vert q)), Reverse KL, Jensen-Shannon, or Total Variation Distance (TVD), depending on which best correlates with acceptance and block efficiency for the specific task and decoding strategy.

The theoretical acceptance rate, for on-policy training loss ϵ\epsilon, is bounded as: ExX[α(x)]1TExX[TLp(x)ϵ]\mathbb{E}_{x \sim X}\left[ \alpha(x) \right] \geq 1 - T \cdot \mathbb{E}_{x \sim X}\left[ \frac{T}{L_p(x)} \epsilon \right] where α(x)\alpha(x) is sequence-level acceptance, TT is proposal block size, and LpL_p is the sequence length as measured by the target.

3. DistillSpec Training Procedure and Practical Implementation

DistillSpec training comprises the following steps:

  1. Data Generation: Generate a mixture of reference sequences using ground-truth data, target (teacher model) outputs, and, critically, sequences sampled from the draft model (on-policy data).
  2. Loss Objective Selection: Select an appropriate divergence measure DD. TVD directly optimizes for acceptance, but FKL, RKL, or JSD may be preferable depending on evaluation metrics and the intended SD setup (e.g., greedy or sampled decoding).
  3. Draft Model Optimization: With white-box access to the target model's logits, minimize: θ=argminθE(x,y)G[1yt=1yD(p(y<t,x)qθ(y<t,x))]\theta^* = \arg\min_\theta\, \mathbb{E}_{(x, y) \sim \mathcal{G}} \left[ \frac{1}{|y|} \sum_{t=1}^{|y|} D \left( p(\cdot|y_{<t}, x) \Vert q^\theta(\cdot|y_{<t}, x) \right) \right]

Utilizing student-generated (on-policy) data rather than teacher-only or fixed data distributions is essential for maximizing downstream token acceptance. The training is efficiently implemented in standard transformer training frameworks given access to the necessary draft and target model forward passes.

4. Extensions: Lossy Speculative Decoding and Cascaded Distillation

DistillSpec applies not only to standard SD but also to lossy speculative decoding, where the acceptance rule is relaxed to admit more draft tokens at the cost of potential fidelity. This is formalized by including a lenience function f(p,ϵ)f(p, \epsilon) in the acceptance criterion: accept yt with probability min(1,f(p(yt),ϵ)q(yt))\text{accept } y_t \text{ with probability } \min \left(1, \frac{f(p(y_t), \epsilon)}{q(y_t)} \right) where ff modulates the degree of lenience.

In multi-stage deployment scenarios (the so-called "model garden"), DistillSpec can be used in conjunction with prior distillation passes: first distilling a large model to a moderately-sized target for task quality, and then further training a compact, well-aligned draft to maximize efficiency:

  • Large Model (distillation) → High-quality Medium Target (distillation + DistillSpec) → Compact, Aligned Draft

This yields substantial cumulative reductions (6–10×) in end-to-end decoding latency while retaining performance.

5. Empirical Performance and Benchmarks

DistillSpec was evaluated on multiple language modeling and NLP tasks (summarization, translation, mathematical reasoning). Measured speedups of 10–45% in actual decoding wall-clock time were observed over speculative decoding with a non-distilled draft model, with block acceptance rates and efficiency consistently higher across datasets. In transfer experiments, on-policy distilled drafts trained for one domain (e.g., GSM8K) also provided substantial efficiency gains on diverse tasks (average 26% speedup on 23 unseen BigBenchHard tasks). Crucially, the method enables smooth control over the latency-quality tradeoff by adjusting the loss and SD acceptance strategy.

A summary of empirical results for a 6-layer student distilled from BERT-Base: | Model | WikiText PPL | GLUE | CoNLL-F1 | SQuAD-EM | SQuAD-F1 | |----------------------|--------------|------|----------|----------|----------| | DistilBERT Baseline | 15.69 | 75.80| 92.12 | 70.23 | 79.99 | | DIITO (FULL+Cos) | 13.45 |77.14 |92.35 |71.94 |81.35 |

Furthermore, cascading with task-specific distillation can reduce latency from 17.3s to 2.7s on XSum with unchanged performance.

6. Best Practices and Implementation Guidelines

  • Prioritize on-policy draft generations for distillation data, as these target the regions of action space most influential for acceptance and ultimate SD yield.
  • White-box access to target logits is critical for effective alignment; black-box (label-only) distillation gives weaker results.
  • Tune divergence functions and data mixtures based on downstream acceptance/block efficiency, not just held-out accuracy; empirical validation is required across tasks.
  • Consider multi-stage distillation if a deployment supports several model sizes, to maximize latency reductions without compromising output quality.
  • Monitor the acceptance rate directly as it most accurately predicts realized SD speed improvements.

7. Context, Limitations, and Future Directions

DistillSpec represents a transition from generic KD to alignment–optimized distillation for efficient deployment of LLMs using SD. Unlike standard KD, which may improve accuracy but not SD acceptance, DistillSpec explicitly targets block-verified token efficiency.

Limitations include the need for white-box (logit) access to the target, which may not be available in all deployment scenarios, and potential sub-optimality when domain shifts are substantial or when the student model is exceedingly compact. The choice of divergence measure and data mixture remains empirical; further work is needed for robust task-general heuristics.

Ongoing research focuses on integrating selective and curriculum-based loss masking (as in AdaSPEC and SpecKD (Hu et al., 22 Oct 2025, Huang et al., 28 Oct 2025)), and adapting these practices to reinforcement learning and non-autoregressive model families.


DistillSpec thus establishes a methodology for producing highly efficient, well-aligned draft models tailored to speculative decoding, yielding substantial reductions in inference cost and latency for LLMs across domains and model sizes, and setting a practical foundation for future progress in alignment-focused model compression (Zhou et al., 2023).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DistillSpec.