AdaSPEC: Efficient Speculative Decoding

Updated 27 October 2025

AdaSPEC is a method for adaptive selective distillation that improves speculative decoding by focusing on tokens with a lower KL divergence gap between draft and reference models.
It employs a two-phase process that first computes per-token losses and then filters out hard-to-learn tokens to concentrate training on those with higher acceptance potential.
Empirical results demonstrate up to a 15% increase in token acceptance and a 10–20% speedup in decoding, underscoring its practical benefits for real-world inference applications.

AdaSPEC is a method for efficient speculative decoding that utilizes selective knowledge distillation to improve alignment and inference speed in LLMs. Speculative Decoding (SD) employs an efficient draft model to generate candidate tokens, which are then verified by a larger target model. Conventional knowledge distillation (KD) techniques, typically focused on minimizing KL divergence over all tokens, are misaligned with the practical goal of SD: maximizing the rate at which draft model tokens are accepted by the target model. AdaSPEC introduces a mechanism for selective filtering of tokens during distillation, concentrating training capacity on "learnable" or easy tokens, thereby increasing acceptance rates without degrading output quality.

1. Motivation and Problem Definition

Speculative Decoding accelerates autoregressive generation by delegating initial token prediction to a fast, small draft model, whose outputs are validated by a more capable target model. The expected performance depends critically on the statistical alignment between these models. Standard KD procedures optimize for average KL divergence across the output distribution, forcing limited-capacity draft models to allocate training effort to all tokens, including those difficult or impossible to learn accurately due to a capacity gap. This indiscriminate pressure leads to suboptimal acceptance rates and resource inefficiency. AdaSPEC reframes the objective: by filtering out hard-to-learn tokens from the distillation target set, it ensures that the draft model optimally aligns with the target for tokens where success is feasible, directly addressing the true goal of SD—increasing accepted token rates.

2. Selective Knowledge Distillation Methodology

AdaSPEC employs a two-phase distillation and token filtering process:

Phase 1: Reference Model Distillation & Token Filtering
- A reference model $M_\mathrm{ref}$ is initialized as a copy of the draft model $M_q$ and distilled from the target $M_p$ via forward KL divergence:
$\mathcal{L}_\mathrm{KD} = \mathbb{E}_{x, y} \left[ \mathrm{KL}(P(\cdot\,|\,\mathrm{context}) \;\|\; R(\cdot\,|\,\mathrm{context})) \right]$

- For every token $w$ , two per-token losses are computed: - $\mathcal{L}_\mathrm{ref}(w) = \mathrm{KL}(P(w|\,\mathrm{context}) \;\|\; R(w|\,\mathrm{context}))$ - $\mathcal{L}_\mathrm{draft}(w) = \mathrm{KL}(P(w|\,\mathrm{context}) \;\|\; Q(w|\,\mathrm{context}))$ - The difference $\Delta\mathcal{L}(w) = \mathcal{L}_\mathrm{draft}(w) - \mathcal{L}_\mathrm{ref}(w)$ quantifies the learnability of each token. Tokens with low $\Delta\mathcal{L}(w)$ are more learnable by the draft model. AdaSPEC selects the top $k$ fraction (e.g., top $k\times 100\%$ tokens) as the filtering set $S$ .

Phase 2: Selective Distillation of the Draft Model
- The draft model is distilled exclusively on the filtered subset of easy tokens, via the loss:
$\mathcal{L}_\mathrm{distill} = \frac{1}{k\,\|\,\cdot\,\|} \sum_i I[y_i \in S]\, \mathcal{L}_\mathrm{draft}(y_i)$

where $I[\cdot]$ is the indicator function.

This process ensures draft model capacity is prioritized for learnable tokens, maximizing SD efficiency.

3. Model Configurations and Implementation Details

The reported experiments utilize two key model pairings:

Configuration	Draft Model	Target Model	Tokenizer
Small-to-Large (same family)	Pythia-31M	Pythia-1.4B	Shared
Medium-to-Large (cross-family)	CodeGen-350M	Phi-2 (2.7B)	Aligned

In the first configuration, architectural consistency between draft and target models maximizes transfer efficiency.
In the second, cross-family (CodeGen $\rightarrow$ Phi-2) transfer is enabled by tokenizer alignment.
The discrepancy in size (and capacity) is more pronounced in the medium-to-large setup, enhancing the utility of selective filtering.

4. Empirical Performance and Evaluation

AdaSPEC is evaluated on arithmetic reasoning (GSM8K), instruction-following (Alpaca), code generation (MBPP), and summarization (CNN/Daily Mail, XSUM), with acceptance rate $\alpha$ as the main metric:

$\alpha = \frac{\text{Number of tokens accepted by the target model}}{\text{Total number of tokens evaluated}}$

Experiments demonstrate:

In the GSM8K task (Pythia-31M $\rightarrow$ 1.4B, 3 epochs): DistillSpec achieves $57.58\%$ acceptance versus $62.63\%$ for AdaSPEC.
Across all settings, AdaSPEC yields up to $15\%$ higher acceptance rates than DistillSpec.
Alignment gains, as evidenced by reduced token-level KL divergence and improved top-2 logit margins, translate into practical benefits, including $10$– $20\%$ speedup in vLLM decoding on A100 GPUs.

5. Analysis and Comparison with DistillSpec

The principal distinction between AdaSPEC and DistillSpec lies in per-token selection:

DistillSpec applies uniform KD, minimizing KL divergence across all tokens, regardless of draft capacity limits.
AdaSPEC introduces token-level filtering using $\Delta\mathcal{L}(w)$ , restricting distillation to tokens where the draft model has headroom to align. This avoids inefficient allocation of draft model capacity to unlearnable tokens.
Ablation studies reveal improved confidence and precision in predictions via AdaSPEC's selection, with reductions in average KL divergence and sharper logit margin distributions.

6. Application Scope and Deployment Implications

AdaSPEC's performance advancements yield several practical benefits:

Inference speedup: Higher acceptance rates allow more tokens to be drafted per verification, reducing reliance on costly target model computation and enabling rapid response in real-time settings (conversational AI, interactive agents).
Resource efficiency: Smaller draft models remain performant, lowering compute and energy requirements for scalable deployment.
Design flexibility: The approach accommodates significant capacity gaps between draft and target models, facilitating wider application in coding assistants, summarization services, and complex reasoning platforms.

A plausible implication is enhanced environmental sustainability for large-scale model deployments due to lower energy consumption.

7. Prospective Research Directions

Outlined future avenues include:

Development of adaptive or multi-stage token filtering strategies, potentially leveraging more sophisticated uncertainty metrics.
Integration with advanced SD frameworks such as EAGLE, and exploration of compositional or multi-step verification architectures.
Investigation of alternative or hybrid loss functions for distillation, including objectives beyond forward KL divergence, particularly for cross-family transfers.
Broader generalization studies to assess applicability in multilingual and multimodal contexts.

This suggests the continuing evolution of selective distillation approaches could further improve practicality and efficiency for large model inference in diverse operational settings.

AdaSPEC represents a selective distillation paradigm, explicitly tailored to speculative decoding objectives, and has demonstrated robust empirical gains over state-of-the-art baselines. The architecture and empirical results suggest significant potential for industry-scale inference and continued methodological refinement.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to AdaSPEC.