RADAR: Accelerating Large Language Model Inference With RL-Based Dynamic Draft Trees (2512.14069v1)

Published 16 Dec 2025 in cs.AI

Abstract: Inference with modern LLMs is expensive and slow, and speculative sampling has emerged as an effective solution to this problem, however, the number of the calls to the draft model for generating candidate tokens in speculative sampling is a preset hyperparameter, lacking flexibility. To generate and utilize the candidate tokens more effectively, we propose RADAR, a novel speculative sampling method with RL-based dynamic draft trees. RADAR formulates the draft tree generation process as a Markov Decision Process (MDP) and employs offline reinforcement learning to train a prediction model, which enables real-time decision on the calls to the draft model, reducing redundant computations and further accelerating inference. Evaluations across three LLMs and four tasks show that RADAR achieves a speedup of 3.17x-4.82x over the auto-regressive decoding baseline. The code is available at https://github.com/minaduki-sora/RADAR.

Abstract PDF Chat (Pro)

Summary

The paper presents a novel RL-based approach that dynamically controls draft tree generation, reducing redundant computations.
It achieves inference speedups of up to 4.82x while maintaining nearly identical acceptance lengths compared to state-of-the-art methods.
Evaluation across multiple open-source LLMs and tasks demonstrates RADAR’s effectiveness in context-adaptive speculative sampling.

RL-Based Dynamic Draft Trees for LLM Inference: The RADAR Method

Overview

The paper "RADAR: Accelerating LLM Inference With RL-Based Dynamic Draft Trees" (2512.14069) presents an advancement in speculative sampling for LLM inference. RADAR introduces a reinforcement learning (RL)-driven approach to dynamically manage the token generation pipeline—specifically, adaptively controlling calls to a draft model and the formation of draft token trees used for speculative sampling. This enables a more efficient, context-sensitive balance between inference speed and computational cost compared to previous static or hyperparameter-dependent methods.

Background and Motivation

Speculative sampling accelerates LLM inference by delegating initial token generation to a smaller "draft" model and verifying suggested continuations with the larger, target LLM. Chain- and tree-based speculative sampling methods have achieved marked reductions in latency, but they typically set the number of draft model invocations per context as a global, static hyperparameter. This rigidity results in inefficiency: in contexts where the draft model's tokens are likely to be rejected, unnecessary compute is expended.

Prior work, such as EAGLE-2 and EAGLE-3, sought to address draft acceptance through improved tree structure optimization and acceptance heuristics but left the fundamental parameterization of draft model usage unchanged. The high rate of completely rejected draft tokens, especially visible in applications like those in MT-bench and LLaMA-Instruct, underscores the need for more context-aware, dynamic control over the draft stage in speculative sampling.

Proposed Method: RADAR

RADAR reframes draft tree generation as a Markov Decision Process (MDP). The core innovation is to make, at each generation step, an RL-driven decision of whether to further invoke the draft model for expanding the draft tree or to halt and proceed to the verification stage.

System Architecture

RADAR integrates three core components:

Target LLM: The principal generator whose outputs are required for downstream tasks.
Draft Model: A smaller, faster LLM proposing candidate token continuations.
Prediction Model: An LSTM-based controller network, trained via offline RL, which ingests the confidence scores from the draft model and outputs a continue/stop decision for each step of draft tree construction.

RL Formulation

The MDP is defined by a state consisting of confidence vectors for top- $k$ tokens at each step; actions for continuing or halting the draft phase; and a reward structure designed to penalize excessively deep draft trees (to avoid diminishing returns) and reward high acceptance lengths (indicating efficient draft coverage). The transition function encapsulates both the deterministic stoppage and the stochastic outcome of drafting, as governed by the speculative sampling acceptance process.

RADAR's training data is generated offline by running the EAGLE-3 speculative sampling process and collecting sequences of confidence states and their corresponding acceptance length distributions for various draft tree depths. The RL controller can then be trained without environment interaction, eliminating issues with extrapolation error, given the alignment between offline data and online rollout distributions.

Distinguishing Features

Dynamic Drafting: RADAR adaptively decides the number of draft calls per input context, discarding the notion of a fixed global number of calls.
RL-Based Decision Making: By leveraging intrinsic and extrinsic reward signals based on empirical acceptance length distributions, RADAR's controller generalizes across tasks and contexts.
Lossless Generation: The core speculative sampling procedure remains "lossless," guaranteeing the output distribution of the target LLM remains unchanged (only the efficiency of achieving it is optimized).

Experimental Evaluation

RADAR is evaluated on three prominent open-source LLMs (LLaMA-Instruct 3.1 8B, Vicuna 13B, DeepSeek-R1-Distill-LLaMA 8B) and four task datasets (MT-bench for dialog, GSM8K for math reasoning, Alpaca for instruction following, MBPP for code generation), with strict evaluation to measure actual acceleration relative to naive auto-regressive decoding.

Key Empirical Findings

Inference Speedup: RADAR provides speedup ratios ranging from 3.17x to 4.82x over auto-regressive decoding, consistently outperforming EAGLE-3 across all settings.
Draft Model Call Reduction: The average number of draft model invocations is reduced by 9.3%–34.3% (mean 18.7%) compared to fixed-depth methods, with large observed savings in cases where candidate draft continuations would otherwise be broadly rejected.
Acceptance Length: RADAR maintains a high acceptance length, only marginally lower (by ≈1.2%) than the best previous methods, demonstrating that early stopping under low-confidence conditions does not harm speculative sampling throughput.

It is particularly noteworthy that RADAR achieves maximal speedup even in scenarios where the average acceptance length matches or even slightly trails previous state-of-the-art methods. The RL controller’s ability to dynamically skip inefficient draft invocations points to computational savings that go beyond what can be achieved via static tree structure optimization.

Theoretical and Practical Implications

RADAR demonstrates that reinforcement learning-based control can outperform fixed or handcrafted draft heuristics in speculative sampling, setting a new paradigm for context-adaptive inference-time optimization in LLM serving. The MDP formulation and offline RL training strategy provide a generic blueprint for further controller refinement, opening the door to using more sophisticated reward shaping, alternative sequence models, or transfer learning across LLMs.

From a systems standpoint, integrating RADAR into deployed LLM pipelines can result in significant reductions in real-world serving costs and latency without any sacrifice to output fidelity. The dynamic reduction in draft computation suggests that, under resource-limited or latency-sensitive scenarios, RL-based speculative sampling control can provide substantial operational advantages.

Future Directions

Future research could focus on:

Enhancing the prediction model through deeper sequence modeling (e.g., transformers or attention-based variants);
More advanced reward engineering, potentially incorporating signals on downstream task utility;
Tight integration with hardware-aware scheduling and quantization for further latency minimization;
Meta-learning approaches to rapidly adapt the controller to new LLM architectures or deployment environments.

Conclusion

RADAR establishes a principled, reinforcement learning-enabled mechanism for dynamic speculative sampling in LLM inference, achieving substantial speedups (up to 4.82x) over classic autoregressive decoding and state-of-the-art static draft-tree methods, all while reducing redundant computation. The RL-based controller’s offline trainability and generality make it a promising framework for next-generation efficient LLM deployment.