Learning to Draft: Adaptive Speculative Decoding
- Learning to Draft (LTD) is a paradigm that generates draft tokens and uses a verification step to improve inference efficiency in large language models.
- It employs reinforcement learning and other techniques to optimize throughput while mitigating the computational overhead of drafting and verification.
- LTD applications span neural machine translation, reasoning tasks, and online adaptation for real-time serving systems.
to=arxiv_search.search 天天中彩票提现 大发快三官网 天天大奖彩票站? to=arxiv_search.search 彩神争霸电脑版 天天中彩票出票json {"query":"\"Learning to Draft\" speculative decoding reinforcement learning", "max_results": 5} to=arxiv_search /久久json {"query":"\"Learning to Draft\" speculative decoding reinforcement learning", "max_results": 5} Learning to Draft (LTD) denotes a family of methods in which a model first produces a draft object—most commonly draft tokens, draft trees, draft reasoning traces, or draft latent states—and then improves downstream efficiency or quality by verifying, refining, or selecting against that draft. In contemporary LLM work, the dominant meaning of LTD is adaptive drafter learning for speculative decoding: a lightweight draft policy proposes multiple future tokens, and a larger target or verifier model evaluates them in parallel, with the resulting feedback used to optimize throughput, acceptance length, or both (Zhang et al., 2 Mar 2026). The phrase also appears in earlier draft-and-refinement sequence generation, in concise reasoning curricula, and in draft-conditioned latent refinement, indicating that LTD is best understood as a broader drafting paradigm whose modern center of gravity lies in LLM inference acceleration (Li et al., 2017).
1. Terminology, scope, and historical lineage
An early precursor appears in neural machine translation as a two-stage drafting-and-refinement procedure. “Enhanced Neural Machine Translation by Learning from Draft” introduced a conventional attention-based NMT system to produce a draft translation and a double-attention NMT system that refines the translation by attending to both the source sentence and the draft translation. On Chinese-English tasks, the reported gains were +0.88 BLEU on the 1 M-sentence NIST task and +2.49 BLEU on the 44 K-sentence IWSLT task, with the larger improvement occurring in the low-resource setting (Li et al., 2017).
In current LLM research, LTD most often refers to speculative decoding methods that learn or adapt the drafter itself rather than treating drafting depth, block size, or the draft model as static. “Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning” formalizes the draft-and-verify loop as an RL problem whose direct objective is throughput, namely accepted-tokens-per-second, and trains lightweight policies for draft depth and verification size (Zhang et al., 2 Mar 2026). Related work broadens the same motif in different directions: self-speculative online updating within one model (Bhansali et al., 6 Oct 2025), online draft adaptation inside a serving engine (Park et al., 5 Feb 2026), online-learning formulations with dynamic regret guarantees (Qian et al., 13 Mar 2026), and cross-vocabulary online drafting for on-device deployment (Ramakrishnan et al., 3 Jul 2025).
The phrase has also expanded beyond inference acceleration. “Draft-Thinking” uses draft-style reasoning to retain only the critical reasoning steps in long chain-of-thought generation (Cao et al., 28 Feb 2026). “DRAFT-RL” uses Chain-of-Draft reasoning in a multi-agent RL framework where each agent produces multiple drafts per query (Li et al., 25 Nov 2025). “When Latent Geometry Is Not Enough” reframes non-autoregressive text generation as draft-conditioned latent refinement rather than generation from noise (Zhang, 15 May 2026). This suggests that LTD now functions as a cross-cutting research pattern centered on learning from an intermediate, revisable draft.
2. Core LTD formulation in speculative decoding
The canonical modern LTD setting is speculative decoding. A small draft model proposes candidate tokens, while a large target model verifies them in parallel. The central observation is that decoding efficiency depends not only on how many tokens are accepted, but on the joint trade-off between draft cost and verification cost. “Learning to Draft” makes this explicit by viewing each draft-and-verify iteration as a single episode in a Markov decision process with throughput reward
where is the number of accepted tokens in the cycle, is drafting time, and is verification time (Zhang et al., 2 Mar 2026).
In that formulation, the state after expansions contains the current draft depth , the finalized prefix length , and predicted log-probabilities of all nodes in the draft tree. Two interleaved policies act on this state. The depth policy chooses or , and the size policy 0 chooses a verification size from 1. The depth policy is a single-layer MLP with hidden size 1024 and ReLU, while the size policy is a two-layer MLP with shape 2 and ReLU. PPO is used with entropy bonus 0.01, learning rate 3 decayed linearly, 20 epochs per update, rollout buffer 2048, and minibatch 256. The two policies are pretrained independently on HumanEval and then co-adapted by alternating freeze-and-optimize steps; empirical gains saturate after two iterations (Zhang et al., 2 Mar 2026).
The reported empirical results reflect the move from proxy optimization to direct throughput optimization. Across five LLMs and four tasks, LTD achieves speedups from 2.24× to 4.64× relative to vanilla autoregressive decoding. Relative to Eagle3, the reported gains reach +36.4% on Qwen3-32B, +10% on DeepSeek-8B, +6.5% on Llama-8B, +5% on Vicuna-13B, and +4% on Qwen3-14B. On MMLU with Llama-8B, LTD outperforms Eagle3 in 54/57 domains, with average +5% speedup (Zhang et al., 2 Mar 2026).
A recurrent conceptual point in this literature is that acceptance length is not the same objective as wall-clock speed. The LTD paper argues that maximizing acceptance length alone can produce oversized trees that actually slow inference, because drafting and verification are interdependent and the real objective is throughput rather than a proxy metric (Zhang et al., 2 Mar 2026).
3. Training objectives for learning the drafter
A major branch of LTD research focuses on how to turn verification outcomes into drafter supervision. “Draft, Verify, and Improve” introduces DVI, a training-aware self-speculative framework that partitions a frozen decoder-only LLM into a shallow draft path and a frozen verify path. The drafter head is a LoRA-augmented classifier on the split-layer hidden state,
4
while only the LoRA parameters 5 are trainable. Accept/reject outcomes from speculative decoding are stored as tuples 6 and used to optimize a combined loss consisting of online distillation, reward-masked cross-entropy, an on-policy REINFORCE term with KL regularization, and an entropy bonus, with a KL7RL schedule over update steps (Bhansali et al., 6 Oct 2025).
The DVI ablations are notable because they isolate which supervision signals actually work. On Spec-Bench with Vicuna-7B, KL-only online KD yields 1.435× speedup with MAT=1.933, PG-only yields 0.341×, CE-only yields 0.335×, and the full KL→RL method yields 2.16× speedup with MAT≈3.5 on many tasks. The reported learning curves state that KL-only gives smooth monotonic gains in batch acceptance but plateaus near 80%, while PG-only and CE-only fail due to sparse/censored feedback. DVI uses 2 000 ShareGPT prompts, one pass, or ≈2 000 updates; competing methods cited in the paper use 60 000–120 000 prompts over 2–40 epochs (Bhansali et al., 6 Oct 2025).
A second line of work addresses the offline-to-inference mismatch of supervised draft training. “Draft-OPD” argues that SFT plateaus because the draft model is trained on target-generated prefixes but evaluated on its own prefixes during speculative decoding. The method therefore performs target-assisted rollout for stable continuations, records the draft’s error positions, and replays those states to compute an acceptance-aware distillation objective: forward KL on accepted tokens and reverse KL on rejected tokens, with earlier rejected positions emphasized by 8 and 9. Under matched FLOPs budgets, the reported average speedup on Qwen3-4B/8B in thinking mode is 4.86× with average accepted length 0, compared with 3.87× and 1 for EAGLE-3 and 4.33× and 2 for DFlash. With thinking mode disabled, Draft-OPD reports 5.31× and 3 (Lei et al., 28 May 2026).
For diffusion-based draft models, the training problem changes form but preserves the same left-to-right objective. “Teaching Diffusion to Speculate Left-to-Right” studies three interventions for a block-diffusion drafter: token positional weighting, first-error focal loss, and a chain loss that acts as a differentiable surrogate for expected accepted length. On Llama-3-8B with block size 4, the reported accepted length moves from 5 for the position-uniform baseline to 6 with positional weighting alone, 7 with first-error focal loss alone, 8 with chain loss alone, and 9 when all three are stacked, a +43.9% improvement. Per-benchmark gains range +21–76% over the position-uniform baseline (Whalen et al., 10 Jun 2026).
Across these variants, a common LTD principle emerges: the drafter should be trained on states exposed by drafting itself, with particular emphasis on the earliest errors that truncate the accepted prefix.
4. Online adaptation, serving systems, and cross-vocabulary drafting
Several LTD systems move adaptation from offline training into deployment. TIDE integrates online draft adaptation directly into the serving engine by reusing target-model hidden states generated during inference as supervision for a compact one-layer draft model. During verification, accepted token states are copied to a host-side ring buffer, later flushed to shared storage, and then consumed by an asynchronous training engine. The draft objective is standard cross-entropy on pairs 0, where 1 is formed by concatenating selected hidden states. The system enables speculative decoding and training only when beneficial, using acceptance-rate EMAs and a practical speedup model
2
and disables speculation whenever that quantity is 3 (Park et al., 5 Feb 2026).
The system-level contributions of TIDE are explicitly heterogeneous. The Inference Serving Engine is placed on NVIDIA H100, the Draft Model Training Engine on AMD MI250, with the rationale that H100 inference throughput is ∼6.8× that of MI250 while H100 training speedup is only ∼2.4×. End-to-end, TIDE reports up to 1.15× throughput improvement over static speculative decoding during live serving, 1.67× faster draft training than approaches that recompute training signals, and a storage reduction from 4.66 TB to 0.19 TB in the cited gpt-oss-120b comparison (Park et al., 5 Feb 2026).
OnlineSPEC provides a more formal online-learning interpretation. It defines the round-4 loss as
5
measures dynamic regret against a time-varying comparator, and connects regret to acceleration rate
6
Theorem 1 states that 7 and
8
The framework then proposes optimistic online learning and online ensemble learning for draft adaptation, with empirical improvements of up to 24% speedup over seven benchmarks and three foundation models (Qian et al., 13 Mar 2026).
OmniDraft addresses a different deployment obstacle: vocabulary mismatch between draft and target. Its “one drafter for all” design uses an online n-gram cache to lift the drafter distribution into the target token space and combines a direct-mapping KL term with an n-gram cross-entropy term, balanced by 9. A single Llama-68M drafter is reported to pair with Vicuna-7B, Qwen2-7B, and Llama3-8B, reaching speedups of 1.5–1.7×, 1.5–1.6×, and 1.6–1.7× respectively after online adaptation. The same framework also adds adaptive drafting through an acceptance-prediction head and early-exit rule (Ramakrishnan et al., 3 Jul 2025).
Taken together, these systems recast LTD as a deployment-time control problem as much as a training problem: when to speculate, when to train, which target to pair with, and which hardware should execute each component.
5. LTD in reinforcement learning and reasoning
LTD has also been integrated into reinforcement learning pipelines where generation cost dominates training time. TLT targets the long-tail response distribution of reasoning RL training by combining an Adaptive Drafter with an Adaptive Rollout Engine. The drafter is a single Transformer decoder layer that shares the target model’s token embedding and LM-head matrices, and it is trained from cached prefill features using a hidden-state L1 loss plus a logit cross-entropy term. Spot training occurs on idle GPUs during rollout imbalance, so the drafter remains aligned with the evolving target at no extra cost. The reported results are over 1.7x end-to-end RL training speedup over state-of-the-art systems, 1.7×–2.1× higher token-throughput than VeRL, top-3 next-token accuracy rising from ≈45% to ≈65%, and accept length increasing from <2 to >6 on an RL-trained target, while average reward curves for TLT and VeRL lie on top of each other throughout 100+ steps (Hu et al., 20 Nov 2025).
FastGRPO uses a concurrency-aware speculative decoding framework for GRPO, together with online draft learning. The draft model minimizes the KL from target policy 0 to draft distribution 1 on accepted speculative segments, and the verification budget is adjusted according to current active concurrency:
2
The derived drafting ratio is
3
in the regime 4. The reported end-to-end speedups are 2.35x to 2.72x on mathematical reasoning datasets and models (Zhang et al., 26 Sep 2025).
A different meaning of LTD appears in reasoning-style generation. Draft-Thinking defines draft-style reasoning as a compressed sequence retaining only the decisive inferences necessary for correctness, then trains this behavior by a three-stage curriculum: draft SFT, GRPO with max length 5, and GRPO with max length 6. On MATH500, the paper reports an 82.6% reduction in reasoning budget at the cost of only a 2.6% performance drop; under the detailed table, accuracy changes from 93.0 to 90.6 while average tokens change from 5 668 to 986. An adaptive prompt lets the model choose between detailed step-by-step and minimal draft reasoning without an external difficulty classifier (Cao et al., 28 Feb 2026).
Multi-agent reasoning extends the notion further. DRAFT-RL has each agent generate 7 drafts per query, each draft containing reasoning steps of at most 5 words each, followed by a final answer. Peer agents score drafts, a reward model selects the best trajectory, and PPO with an auxiliary imitation loss updates the policy. The reported gains include MBPP Pass@1 rising from 78.1 to 82.6, HumanEval from 84.5 to 87.6, GSM8K from 91.8 to 94.2, and MATH from 52.1 to 55.8, with convergence steps reduced from 2,230 to 1,420 on HumanEval and from 2,850 to 1,650 on MATH (Li et al., 25 Nov 2025).
In these RL and reasoning settings, LTD no longer refers only to speculative token verification. It becomes a general mechanism for making exploration concise, adaptive, and computationally efficient.
6. Empirical patterns, misconceptions, and adjacent directions
One recurring misconception is that better acceptance length alone guarantees better speed. The RL-based LTD paper directly disputes this by stating that maximizing acceptance length alone ignores the real wall-clock costs of drafting and verifying and can produce oversized trees that actually slow inference (Zhang et al., 2 Mar 2026). A second misconception is that more offline supervised data is sufficient. Draft-OPD reports that simply continuing SFT on the OPD prompts yields no 8 gain, and DVI reports that KL-only online KD plateaus near 80% batch acceptance while PG-only and CE-only fail under sparse or censored feedback (Lei et al., 28 May 2026, Bhansali et al., 6 Oct 2025).
A third misconception concerns representation quality. In draft-conditioned latent refinement, latent geometry metrics such as scale matching or cosine similarity do not guarantee good decoding. On ROCStories with 768-dimensional BERT latents, the DraftPrior target-token probability is 0.938 for clean drafts, 0.613 for 3% token dropout, 0.483 for 5% dropout, and 0.272 for 10% dropout. The paper’s main result is explicitly diagnostic: latent geometry alone is not enough; decoder recoverability, the quality of the start distribution, and preservation of decoder-readable structure are the relevant criteria (Zhang, 15 May 2026).
Static draft construction remains relevant as a baseline and deployment pathway. FastDraft trains vocabulary-compatible drafts by pre-training on 9–0 billion tokens and then fine-tuning on synthetic data generated by the target model. The paper reports approximately 10 billion tokens on a single server with 8 Intel® Gaudi® 2 accelerators in under 24 hours, memory-bound speedup up to 3× on code completion, up to 2× on summarization, text completion and instruction tasks, and wall-clock speedup of up to 2x on Intel® Core™ Ultra. Acceptance rates for the Phi-3-mini 50M draft are reported as AR ≈ 0.37 on CNN-DM, AR ≈ 0.31 on TinyStories, AR ≈ 0.37 on Dolly, and AR ≈ 0.56 on HumanEval (Zafrir et al., 2024).
Dynamic draft-tree control provides another axis of LTD. RADAR models the decision to continue or stop draft expansion as an MDP over top-1 confidence scores, trains an LSTM policy with offline REINFORCE on data collected from EAGLE-3, and dynamically finalizes variable-depth trees for speculative sampling. Across LLaMA-Instruct 3.1 8B, Vicuna 13B, and DeepSeek-R1-Distill-LLaMA 8B, RADAR reports speedups of 3.17x–4.82x over autoregressive decoding while reducing average draft-model calls by 9.3 %–34.3 % with average 18.7 % relative to fixed 8-call EAGLE-3 (Ma et al., 16 Dec 2025).
Viewed together, these results indicate that LTD is not a single algorithmic recipe. It is a design stance in which the draft stage is treated as a learnable, adaptive component whose objective must be aligned with the final system criterion—BLEU in early NMT refinement, throughput in speculative decoding, rollout efficiency in RL training, or decoder recoverability in latent-generation settings.