Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

Published 2 Mar 2026 in cs.CL | (2603.01639v1)

Abstract: Speculative decoding accelerates LLM inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency. We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper introduces LTD, a novel method that uses RL to jointly optimize draft tree depth and verification batch size for improved LLM throughput.
The paper demonstrates substantial speedups—up to 4.32x on large models—with consistent outperforming of static or heuristic-based speculative decoding strategies.
The paper shows that lightweight, co-adaptive MLP policies can minimize overhead (<1.5%) while ensuring scalability and robust performance across diverse benchmarks.

Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

Problem Statement and Motivation

Speculative decoding is a widely adopted technique to accelerate LLM inference by leveraging a compact draft model for generating candidate tokens, and subsequently having a larger target model verify them in batches. The latency bottleneck in speculative decoding stems from the trade-off between the time spent during candidate drafting and verification. Existing approaches, including chain-structured (Chen et al., 2023) and tree-structured (Miao et al., 2023, Li et al., 3 Mar 2025) speculative decoding, typically employ either static configurations or heuristic-driven dynamics that optimize proxy metrics (e.g., acceptance length). Such strategies are fundamentally limited, as they ignore the real time costs and isolate drafting from verification, thus suboptimally utilizing resources and yielding non-maximal throughput.

Methodology: Reinforcement Learning for Throughput Maximization

The paper introduces LTD (Learning to Draft), formulating the speculative decoding cycle as a joint RL environment. LTD deploys two lightweight, co-adaptive MLP-based policies—a depth policy and a size policy. The depth policy dynamically determines draft tree depth, thereby controlling the draft model's computational cost. The size policy adaptively selects the verification batch size, tailoring the target model's workload for each cycle.

The RL reward is explicitly aligned with throughput, defined as the number of accepted tokens per cycle divided by the total latency (draft plus verification time). Policies observe highly informative state vectors comprising context length, current draft depth, and candidate token probabilities. Training is accomplished using PPO (Schulman et al., 2017), with an iterative co-adaptation procedure: each policy is alternately frozen and optimized allowing for strategic synergy, rather than naive independent operation.

Technical Contributions

Direct Throughput Optimization: Unlike prior work optimizing acceptance length (Miao et al., 2023, Li et al., 3 Mar 2025) or relying on static schedules, LTD maximizes throughput—a mathematically precise and practical metric for low-latency inference environments.
Co-Adaptive Policy Framework: The dual-policy architecture enables holistic control of the draft-and-verify cycle. The iterative training ensures mutual adaptation, overcoming limitations of isolated optimization.
Minimal Overhead: Empirical policy overhead is consistently <1.5% of total inference time, due to compact MLP architectures and carefully chosen state representations. Ablation studies validate this design.

Experimental Results

Evaluations are performed on five prominent LLMs (Llama3-8B, Vicuna-13B, DeepSeek-R1-Distill-LLaMA-8B, Qwen3-14B, Qwen3-32B) and four task benchmarks (MT-bench, GSM8K, Alpaca, Natural Questions). LTD achieves speedup ratios ranging 2.24x–4.32x, with improvements over the strong Eagle3 baseline reaching:

36.4% on Qwen3-32B (notably robust at larger model scales),
9.5% on Deepseek-8B,
6.5% on Llama3-8B,
5% on Vicuna-13B,
4% on Qwen3-14B.

Ablation studies demonstrate that throughput-based rewards are superior; optimizing for acceptance length leads to excessive verification time and diminished speedup, while time-cost-only rewards yield a suboptimal acceptance length. LTD is notably robust even under high-temperature sampling where other dynamic methods degrade.

Cross-domain generalization experiments using MMLU (57 subtasks) show LTD outperforms Eagle3 on 54/57 tasks, with >10% speedup improvements in mathematics and logic. The method's efficacy extends to other tree-based speculative decoding techniques, including Griffin (Hu et al., 16 Feb 2025), further validating architecture-agnostic applicability.

Analysis of Policy Interaction and Ablation

Iterative co-adaptation yields marked improvement versus naive joint deployment of initial policies. Depth policy is critical for acceleration, as shallow draft trees in low-confidence scenarios dramatically reduce unnecessary computation. Strategic allocation of verification size enables aggressive candidate batching only when current context probabilities warrant it. Inclusion of computationally expensive features (hidden states, entropy) in policy observations yields marginal gains and increased latency; the optimal LTD configuration exploits lightweight statistics (token probabilities, context length, draft depth).

Implications and Future Directions

LTD's approach establishes a paradigm wherein speculative decoding is governed by dynamic, throughput-maximizing controllers, fundamentally improving latency and resource utilization without altering output distribution. This formalization via RL is scalable, modular, and deployable across LLM architectures and speculative decoding variants. The thorough RL framework (using PPO) is shown to be well-suited, balancing training efficiency with inference realism.

Theoretical implications include the insight that maximizing acceptance length alone can be counterproductive, and only a joint time-aware strategy realizes optimal acceleration. Practically, LTD is poised to enhance LLM serving in latency-critical settings, including batch processing and real-time applications. Future research can extend to more granular controller hierarchies, online adaptation under distribution shift, and integration with heterogeneous hardware and pipeline parallelism platforms.

Conclusion

The LTD method represents a rigorous and effective solution for dynamic speculative decoding acceleration in LLMs. By explicitly and jointly optimizing throughput with two co-adaptive RL policies, LTD robustly outperforms even heavily tuned baselines. Its empirical gains, low overhead, and generalization across tasks and architectures position it as a state-of-the-art foundation for future fast LLM inference research.

Markdown Report Issue