Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 91 tok/s
Gemini 3.0 Pro 46 tok/s Pro
Gemini 2.5 Flash 148 tok/s Pro
Kimi K2 170 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

EAGLE-3: Accelerating LLM Inference

Updated 20 November 2025
  • EAGLE-3 is an inference acceleration technique for LLMs that employs direct token prediction and multi-layer feature fusion to overcome sequential decoding bottlenecks.
  • It replaces feature regression with token-level drafting, enabling the use of larger training datasets and significantly improving speculative sampling acceptance rates.
  • EAGLE-3 achieves notable speedups—up to 6.47× in tests—with enhanced throughput on diverse tasks such as multi-turn chat, code generation, and summarization.

EAGLE-3 is an inference acceleration technique for LLMs that employs speculative sampling with direct token-level drafting and multi-layer feature fusion facilitated by the "training-time test" procedure. Developed to address the persistent inference bottleneck in autoregressive LLMs, EAGLE-3 significantly improves throughput and speedup across diverse task domains and model families, advancing previous feature-level speculative sampling methods by eliminating feature prediction constraints and more effectively utilizing large training corpora (Li et al., 3 Mar 2025).

1. Inference Bottleneck and Speculative Sampling

Autoregressive LLMs, particularly those with hundreds of billions of parameters, require a sequential full forward pass for each generated token, resulting in slow and costly inference. Speculative sampling mitigates this by splitting generation into two phases: (1) a lightweight "draft" model that proposes multiple tokens in parallel and (2) verification by the full-scale "target" LLM that scores and accepts or rejects the draft outputs.

Vanilla speculative sampling employs a small draft model qq to generate kk candidate tokens in one pass, followed by parallel scoring in the target model pp. The acceptance probability for the ii-th draft token t^j+i\hat t_{j+i} is given by: αi=min(1,p(t^j+iT1:j+i1)q(t^j+iT1:j+i1))\alpha_i = \min \Bigl( 1, \frac{p(\hat t_{j+i} \mid T_{1:j+i-1})}{q(\hat t_{j+i} \mid T_{1:j+i-1})}\Bigr) A key limitation is the insufficient accuracy of the draft model, leading to low acceptance rates and restricted speedups.

The EAGLE and EAGLE-2 methods introduced feature-prediction-based drafting: instead of predicting tokens, the draft model generates top-layer hidden features ft+1f_{t+1} for the target model, which are then passed to the LM head to obtain token distributions. The corresponding training loss is: LEAGLE=f^t+1ft+12fea+logp(tt+1f^t+1)token\mathcal{L}_{\rm EAGLE} = \underbrace{\|\hat f_{t+1} - f_{t+1}\|^2}_{\ell_{\rm fea}} + \underbrace{-\log\,p(t_{t+1} \mid \hat f_{t+1})}_{\ell_{\rm token}} These approaches, particularly EAGLE-2 with dynamic tree-pruning, improve speculative sampling but remain limited by expressivity constraints imposed by feature regression and are not able to fully leverage larger training sets.

2. Direct Token Prediction and Training-Time Test

EAGLE-3 introduces two principal innovations: the replacement of feature-level drafting in favor of direct token prediction, and the fusion of multi-layer features using a novel "training-time test" simulation.

The draft model in EAGLE-3 outputs a token distribution qq at each step, removing the feature regression loss entirely. The training objective for a generation prefix of length tt is: LE3=i=1klogq(tt+ig1:t,at+1:t+i1)\mathcal{L}_{\rm E3} = -\sum_{i=1}^k \log\,q(t_{t+i} \mid g_{1:t},\, a_{t+1:t+i-1}) where g1:tg_{1:t} are fused features and at+1:t+i1a_{t+1:t+i-1} are previous draft outputs. This formulation enables the draft model to capture richer representations and fully exploit increases in training corpus size.

Multi-layer feature fusion avoids the narrow top-layer constraints of prior EAGLE versions by collecting features from multiple transformer layers (e.g., low-, mid-, and high-level). These are fused via concatenation and a learned linear projection: gt=Wfuse[ft(1);;ft(L)]Rkg_t = W_{\rm fuse}[f^{(1)}_t; \dots; f^{(L)}_t] \in \mathbb{R}^k The training-time test enables the draft model to simulate multi-step autoregressive generation during training, attending to its own previous predictions through custom causal masks. At each simulated position, the model restricts attention to the correct causal prefix, thereby learning to handle its own outputs as inputs—a crucial requirement for robust drafting at inference.

3. Draft Model Architecture and Pipeline Design

EAGLE-3 integrates multi-layer feature fusion and direct token prediction within a streamlined draft model architecture. The inference pipeline operates as follows:

  1. The target model pp pre-fills the prefix T1:tT_{1:t}, exposing low-, mid-, and high-level features (lt,mt,ht)(l_t, m_t, h_t).
  2. These features are fused into gt=Wfuse[lt;mt;ht]g_t = W_{\rm fuse}[l_t; m_t; h_t].
  3. The draft model receives as input the tuple {g1:t,at+1:t+j1}\{g_{1:t},\, a_{t+1:t+j-1}\} and processes it with a single-layer Transformer decoder (consisting of self-attention and feed-forward modules).
  4. The output vector at+ja_{t+j} is projected by the target model’s LM head to yield the draft token distribution q()q(\cdot). The draft token t^t+j\hat t_{t+j} is sampled from this distribution.
  5. This draft/verify loop is repeated for up to kk tokens, after which all are verified by the target LLM in parallel.

Compared to prior methods, the multi-layer fusion approach enables the draft model to access more diverse semantic representations, improving draft quality and acceptance rates.

4. Mathematical Formulation and Speedup Metrics

The target LLM’s next-token distribution is expressed as: p(tt+1T1:t)p(t_{t+1} \mid T_{1:t}) while the draft model’s approximation is: q(t^t+1g1:t,a<t+1)q(\hat t_{t+1} \mid g_{1:t},\, a_{<t+1}) The speedup ratio SS is formally defined as: S=TbaseTEAGLE3S = \frac{T_{\rm base}}{T_{\rm EAGLE3}} where TbaseT_{\rm base} is the wall-clock token generation time for standard autoregressive decoding and TEAGLE3T_{\rm EAGLE3} is the time per token with EAGLE-3’s draft-plus-verify mechanism.

5. Experimental Validation

EAGLE-3 was evaluated on five tasks using various target models:

  • Multi-turn chat: MT-bench
  • Code generation: HumanEval
  • Mathematical reasoning: GSM8K
  • Instruction following: Alpaca dataset
  • Summarization: CNN/Daily Mail

Target models included Vicuna-13B, LLaMA-Instruct 3.1 (8B), LLaMA-Instruct 3.3 (70B), and DeepSeek-R1-Distill-LLaMA (8B).

Key speedup and throughput results include:

Task/Config SpS EAGLE EAGLE-2 EAGLE-3
MT-bench, Vicuna-13B, T=0 1.93× 3.07× 4.26× 5.58×
Mean across tasks, Vicuna-13B,T=0 3.05× 4.22× 5.51×
Mean across tasks, Vicuna-13B,T=1 2.44× 3.76× 4.65×
HumanEval, Vicuna-13B, T=0 6.47×
Batch 24 throughput (vLLM=1.0×) 1.03× 1.42×
Batch 56 throughput (vLLM=1.0×) 0.71× 1.01×

SpS: Vanilla speculative sampling; T: Sampling temperature.

EAGLE-3 achieves up to 6.47× single-task speedup (HumanEval) and overall mean speedups of 5.51× (greedy sampling, Vicuna-13B, five-task average), with consistent improvements of approximately 1.4× over EAGLE-2. In large-batch configurations (e.g., SGLang framework, batch size 64), EAGLE-3 yields 1.38× throughput improvement relative to baseline (Li et al., 3 Mar 2025).

A plausible implication is that direct token drafting and multi-layer fusion more effectively scale with increased training data, overcoming expressivity saturation observed in feature-regression-based drafting.

6. Implementation and Reproducibility

The EAGLE-3 implementation leverages HuggingFace Transformers and PyTorch, integrating custom self-attention masks to enable the training-time test simulation of multi-step autoregressive drafting. For large-batch throughput, EAGLE-3 operates within the vLLM infrastructure.

Training utilized the AdamW optimizer (β1=0.9,β2=0.95\beta_1=0.9, \beta_2=0.95), a learning rate of 5×1055\times10^{-5}, and gradient clipping at 0.5. The training corpora comprised ShareGPT (68k conversations), UltraChat-200K (464k), and OpenThoughts-114k-math for the math-specialized draft model, yielding approximately 8× larger datasets than EAGLE/EAGLE-2.

Key engineering features include custom causal masks for efficient handling of draft outputs during training, and dynamic pruning/tree-based structuring inherited from EAGLE-2—now enhanced by direct token-level drafting and multi-layer fusion. All code, scripts, and model checkpoints are available at https://github.com/SafeAILab/EAGLE enabling full reproducibility of the reported results (Li et al., 3 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to EAGLE-3.