EAGLE-3: Accelerating LLM Inference
- EAGLE-3 is an inference acceleration technique for LLMs that employs direct token prediction and multi-layer feature fusion to overcome sequential decoding bottlenecks.
- It replaces feature regression with token-level drafting, enabling the use of larger training datasets and significantly improving speculative sampling acceptance rates.
- EAGLE-3 achieves notable speedups—up to 6.47× in tests—with enhanced throughput on diverse tasks such as multi-turn chat, code generation, and summarization.
EAGLE-3 is an inference acceleration technique for LLMs that employs speculative sampling with direct token-level drafting and multi-layer feature fusion facilitated by the "training-time test" procedure. Developed to address the persistent inference bottleneck in autoregressive LLMs, EAGLE-3 significantly improves throughput and speedup across diverse task domains and model families, advancing previous feature-level speculative sampling methods by eliminating feature prediction constraints and more effectively utilizing large training corpora (Li et al., 3 Mar 2025).
1. Inference Bottleneck and Speculative Sampling
Autoregressive LLMs, particularly those with hundreds of billions of parameters, require a sequential full forward pass for each generated token, resulting in slow and costly inference. Speculative sampling mitigates this by splitting generation into two phases: (1) a lightweight "draft" model that proposes multiple tokens in parallel and (2) verification by the full-scale "target" LLM that scores and accepts or rejects the draft outputs.
Vanilla speculative sampling employs a small draft model to generate candidate tokens in one pass, followed by parallel scoring in the target model . The acceptance probability for the -th draft token is given by: A key limitation is the insufficient accuracy of the draft model, leading to low acceptance rates and restricted speedups.
The EAGLE and EAGLE-2 methods introduced feature-prediction-based drafting: instead of predicting tokens, the draft model generates top-layer hidden features for the target model, which are then passed to the LM head to obtain token distributions. The corresponding training loss is: These approaches, particularly EAGLE-2 with dynamic tree-pruning, improve speculative sampling but remain limited by expressivity constraints imposed by feature regression and are not able to fully leverage larger training sets.
2. Direct Token Prediction and Training-Time Test
EAGLE-3 introduces two principal innovations: the replacement of feature-level drafting in favor of direct token prediction, and the fusion of multi-layer features using a novel "training-time test" simulation.
The draft model in EAGLE-3 outputs a token distribution at each step, removing the feature regression loss entirely. The training objective for a generation prefix of length is: where are fused features and are previous draft outputs. This formulation enables the draft model to capture richer representations and fully exploit increases in training corpus size.
Multi-layer feature fusion avoids the narrow top-layer constraints of prior EAGLE versions by collecting features from multiple transformer layers (e.g., low-, mid-, and high-level). These are fused via concatenation and a learned linear projection: The training-time test enables the draft model to simulate multi-step autoregressive generation during training, attending to its own previous predictions through custom causal masks. At each simulated position, the model restricts attention to the correct causal prefix, thereby learning to handle its own outputs as inputs—a crucial requirement for robust drafting at inference.
3. Draft Model Architecture and Pipeline Design
EAGLE-3 integrates multi-layer feature fusion and direct token prediction within a streamlined draft model architecture. The inference pipeline operates as follows:
- The target model pre-fills the prefix , exposing low-, mid-, and high-level features .
- These features are fused into .
- The draft model receives as input the tuple and processes it with a single-layer Transformer decoder (consisting of self-attention and feed-forward modules).
- The output vector is projected by the target model’s LM head to yield the draft token distribution . The draft token is sampled from this distribution.
- This draft/verify loop is repeated for up to tokens, after which all are verified by the target LLM in parallel.
Compared to prior methods, the multi-layer fusion approach enables the draft model to access more diverse semantic representations, improving draft quality and acceptance rates.
4. Mathematical Formulation and Speedup Metrics
The target LLM’s next-token distribution is expressed as: while the draft model’s approximation is: The speedup ratio is formally defined as: where is the wall-clock token generation time for standard autoregressive decoding and is the time per token with EAGLE-3’s draft-plus-verify mechanism.
5. Experimental Validation
EAGLE-3 was evaluated on five tasks using various target models:
- Multi-turn chat: MT-bench
- Code generation: HumanEval
- Mathematical reasoning: GSM8K
- Instruction following: Alpaca dataset
- Summarization: CNN/Daily Mail
Target models included Vicuna-13B, LLaMA-Instruct 3.1 (8B), LLaMA-Instruct 3.3 (70B), and DeepSeek-R1-Distill-LLaMA (8B).
Key speedup and throughput results include:
| Task/Config | SpS | EAGLE | EAGLE-2 | EAGLE-3 |
|---|---|---|---|---|
| MT-bench, Vicuna-13B, T=0 | 1.93× | 3.07× | 4.26× | 5.58× |
| Mean across tasks, Vicuna-13B,T=0 | — | 3.05× | 4.22× | 5.51× |
| Mean across tasks, Vicuna-13B,T=1 | — | 2.44× | 3.76× | 4.65× |
| HumanEval, Vicuna-13B, T=0 | — | — | — | 6.47× |
| Batch 24 throughput (vLLM=1.0×) | — | 1.03× | — | 1.42× |
| Batch 56 throughput (vLLM=1.0×) | — | 0.71× | — | 1.01× |
SpS: Vanilla speculative sampling; T: Sampling temperature.
EAGLE-3 achieves up to 6.47× single-task speedup (HumanEval) and overall mean speedups of 5.51× (greedy sampling, Vicuna-13B, five-task average), with consistent improvements of approximately 1.4× over EAGLE-2. In large-batch configurations (e.g., SGLang framework, batch size 64), EAGLE-3 yields 1.38× throughput improvement relative to baseline (Li et al., 3 Mar 2025).
A plausible implication is that direct token drafting and multi-layer fusion more effectively scale with increased training data, overcoming expressivity saturation observed in feature-regression-based drafting.
6. Implementation and Reproducibility
The EAGLE-3 implementation leverages HuggingFace Transformers and PyTorch, integrating custom self-attention masks to enable the training-time test simulation of multi-step autoregressive drafting. For large-batch throughput, EAGLE-3 operates within the vLLM infrastructure.
Training utilized the AdamW optimizer (), a learning rate of , and gradient clipping at 0.5. The training corpora comprised ShareGPT (68k conversations), UltraChat-200K (464k), and OpenThoughts-114k-math for the math-specialized draft model, yielding approximately 8× larger datasets than EAGLE/EAGLE-2.
Key engineering features include custom causal masks for efficient handling of draft outputs during training, and dynamic pruning/tree-based structuring inherited from EAGLE-2—now enhanced by direct token-level drafting and multi-layer fusion. All code, scripts, and model checkpoints are available at https://github.com/SafeAILab/EAGLE enabling full reproducibility of the reported results (Li et al., 3 Mar 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free