EAGLE-3: Accelerating LLM Inference

Updated 20 November 2025

EAGLE-3 is an inference acceleration technique for LLMs that employs direct token prediction and multi-layer feature fusion to overcome sequential decoding bottlenecks.
It replaces feature regression with token-level drafting, enabling the use of larger training datasets and significantly improving speculative sampling acceptance rates.
EAGLE-3 achieves notable speedups—up to 6.47× in tests—with enhanced throughput on diverse tasks such as multi-turn chat, code generation, and summarization.

EAGLE-3 is an inference acceleration technique for LLMs that employs speculative sampling with direct token-level drafting and multi-layer feature fusion facilitated by the "training-time test" procedure. Developed to address the persistent inference bottleneck in autoregressive LLMs, EAGLE-3 significantly improves throughput and speedup across diverse task domains and model families, advancing previous feature-level speculative sampling methods by eliminating feature prediction constraints and more effectively utilizing large training corpora (Li et al., 3 Mar 2025).

1. Inference Bottleneck and Speculative Sampling

Autoregressive LLMs, particularly those with hundreds of billions of parameters, require a sequential full forward pass for each generated token, resulting in slow and costly inference. Speculative sampling mitigates this by splitting generation into two phases: (1) a lightweight "draft" model that proposes multiple tokens in parallel and (2) verification by the full-scale "target" LLM that scores and accepts or rejects the draft outputs.

Vanilla speculative sampling employs a small draft model $q$ to generate $k$ candidate tokens in one pass, followed by parallel scoring in the target model $p$ . The acceptance probability for the $i$ -th draft token $\hat t_{j+i}$ is given by: $\alpha_i = \min \Bigl( 1, \frac{p(\hat t_{j+i} \mid T_{1:j+i-1})}{q(\hat t_{j+i} \mid T_{1:j+i-1})}\Bigr)$ A key limitation is the insufficient accuracy of the draft model, leading to low acceptance rates and restricted speedups.

The EAGLE and EAGLE-2 methods introduced feature-prediction-based drafting: instead of predicting tokens, the draft model generates top-layer hidden features $f_{t+1}$ for the target model, which are then passed to the LM head to obtain token distributions. The corresponding training loss is: $\mathcal{L}_{\rm EAGLE} = \underbrace{\|\hat f_{t+1} - f_{t+1}\|^2}_{\ell_{\rm fea}} + \underbrace{-\log\,p(t_{t+1} \mid \hat f_{t+1})}_{\ell_{\rm token}}$ These approaches, particularly EAGLE-2 with dynamic tree-pruning, improve speculative sampling but remain limited by expressivity constraints imposed by feature regression and are not able to fully leverage larger training sets.

2. Direct Token Prediction and Training-Time Test

EAGLE-3 introduces two principal innovations: the replacement of feature-level drafting in favor of direct token prediction, and the fusion of multi-layer features using a novel "training-time test" simulation.

The draft model in EAGLE-3 outputs a token distribution $q$ at each step, removing the feature regression loss entirely. The training objective for a generation prefix of length $t$ is: $\mathcal{L}_{\rm E3} = -\sum_{i=1}^k \log\,q(t_{t+i} \mid g_{1:t},\, a_{t+1:t+i-1})$ where $g_{1:t}$ are fused features and $a_{t+1:t+i-1}$ are previous draft outputs. This formulation enables the draft model to capture richer representations and fully exploit increases in training corpus size.

Multi-layer feature fusion avoids the narrow top-layer constraints of prior EAGLE versions by collecting features from multiple transformer layers (e.g., low-, mid-, and high-level). These are fused via concatenation and a learned linear projection: $g_t = W_{\rm fuse}[f^{(1)}_t; \dots; f^{(L)}_t] \in \mathbb{R}^k$ The training-time test enables the draft model to simulate multi-step autoregressive generation during training, attending to its own previous predictions through custom causal masks. At each simulated position, the model restricts attention to the correct causal prefix, thereby learning to handle its own outputs as inputs—a crucial requirement for robust drafting at inference.

3. Draft Model Architecture and Pipeline Design

EAGLE-3 integrates multi-layer feature fusion and direct token prediction within a streamlined draft model architecture. The inference pipeline operates as follows:

The target model $p$ pre-fills the prefix $T_{1:t}$ , exposing low-, mid-, and high-level features $(l_t, m_t, h_t)$ .
These features are fused into $g_t = W_{\rm fuse}[l_t; m_t; h_t]$ .
The draft model receives as input the tuple $\{g_{1:t},\, a_{t+1:t+j-1}\}$ and processes it with a single-layer Transformer decoder (consisting of self-attention and feed-forward modules).
The output vector $a_{t+j}$ is projected by the target model’s LM head to yield the draft token distribution $q(\cdot)$ . The draft token $\hat t_{t+j}$ is sampled from this distribution.
This draft/verify loop is repeated for up to $k$ tokens, after which all are verified by the target LLM in parallel.

Compared to prior methods, the multi-layer fusion approach enables the draft model to access more diverse semantic representations, improving draft quality and acceptance rates.

4. Mathematical Formulation and Speedup Metrics

The target LLM’s next-token distribution is expressed as: $p(t_{t+1} \mid T_{1:t})$ while the draft model’s approximation is: $q(\hat t_{t+1} \mid g_{1:t},\, a_{<t+1})$ The speedup ratio $S$ is formally defined as: $S = \frac{T_{\rm base}}{T_{\rm EAGLE3}}$ where $T_{\rm base}$ is the wall-clock token generation time for standard autoregressive decoding and $T_{\rm EAGLE3}$ is the time per token with EAGLE-3’s draft-plus-verify mechanism.

5. Experimental Validation

EAGLE-3 was evaluated on five tasks using various target models:

Multi-turn chat: MT-bench
Code generation: HumanEval
Mathematical reasoning: GSM8K
Instruction following: Alpaca dataset
Summarization: CNN/Daily Mail

Target models included Vicuna-13B, LLaMA-Instruct 3.1 (8B), LLaMA-Instruct 3.3 (70B), and DeepSeek-R1-Distill-LLaMA (8B).

Key speedup and throughput results include:

Task/Config	SpS	EAGLE	EAGLE-2	EAGLE-3
MT-bench, Vicuna-13B, T=0	1.93×	3.07×	4.26×	5.58×
Mean across tasks, Vicuna-13B,T=0	—	3.05×	4.22×	5.51×
Mean across tasks, Vicuna-13B,T=1	—	2.44×	3.76×	4.65×
HumanEval, Vicuna-13B, T=0	—	—	—	6.47×
Batch 24 throughput (vLLM=1.0×)	—	1.03×	—	1.42×
Batch 56 throughput (vLLM=1.0×)	—	0.71×	—	1.01×

SpS: Vanilla speculative sampling; T: Sampling temperature.

EAGLE-3 achieves up to 6.47× single-task speedup (HumanEval) and overall mean speedups of 5.51× (greedy sampling, Vicuna-13B, five-task average), with consistent improvements of approximately 1.4× over EAGLE-2. In large-batch configurations (e.g., SGLang framework, batch size 64), EAGLE-3 yields 1.38× throughput improvement relative to baseline (Li et al., 3 Mar 2025).

A plausible implication is that direct token drafting and multi-layer fusion more effectively scale with increased training data, overcoming expressivity saturation observed in feature-regression-based drafting.

6. Implementation and Reproducibility

The EAGLE-3 implementation leverages HuggingFace Transformers and PyTorch, integrating custom self-attention masks to enable the training-time test simulation of multi-step autoregressive drafting. For large-batch throughput, EAGLE-3 operates within the vLLM infrastructure.

Training utilized the AdamW optimizer ( $\beta_1=0.9, \beta_2=0.95$ ), a learning rate of $5\times10^{-5}$ , and gradient clipping at 0.5. The training corpora comprised ShareGPT (68k conversations), UltraChat-200K (464k), and OpenThoughts-114k-math for the math-specialized draft model, yielding approximately 8× larger datasets than EAGLE/EAGLE-2.

Key engineering features include custom causal masks for efficient handling of draft outputs during training, and dynamic pruning/tree-based structuring inherited from EAGLE-2—now enhanced by direct token-level drafting and multi-layer fusion. All code, scripts, and model checkpoints are available at https://github.com/SafeAILab/EAGLE enabling full reproducibility of the reported results (Li et al., 3 Mar 2025).

PDF Markdown Chat (Pro)

References (1)

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to EAGLE-3.