EAGLE-2: Dynamic Lossless LLM Acceleration
- The paper introduces EAGLE-2, a method that employs a dynamic, context-aware draft tree to enhance speculative sampling without altering the generated distribution.
- EAGLE-2 uses a two-phase process—expansion based on confidence-derived value estimation and reranking of candidate tokens—to achieve empirical speedups of 20–40% over prior methods.
- The algorithm maintains token acceptance fidelity of the original model while significantly increasing parallelism and practical inference efficiency in large language models.
EAGLE-2 is a lossless acceleration algorithm for LLM inference that extends speculative sampling and the static-tree-based EAGLE method by introducing a context-aware dynamic draft tree. By leveraging the calibration properties of the draft model to dynamically adjust the speculative sampling process, EAGLE-2 achieves increased parallelism and higher token acceptance rates without altering the probability distribution of generated sequences (Li et al., 2024).
1. Motivation and Background
A principal bottleneck in decoding with LLMs is the linear dependence of inference cost on sequence length, since generation typically requires one forward pass per token. Speculative sampling mitigates this by using a smaller draft model to generate multiple candidate tokens in parallel, then verifying these candidates with the full model, accepting as many tokens as possible in a single batch.
Prior methods such as EAGLE ("EAGLE: Efficient Autoregressive Generation with Look-ahead Evidence," Li et al., ICML 2024) employ static draft trees—predefined k-ary tree shapes for speculative sampling. These static trees implicitly assume homogeneity in draft token acceptance rates by position, neglecting the strong empirical context dependence observed in practice. Consequently, they fail to optimally allocate computational budget in "easy" versus "hard" semantic contexts.
EAGLE-2 addresses these limitations by constructing the draft tree dynamically, informed by the draft model’s confidence scores, and targets maximization of the expected acceptance length per speculative batch—yielding empirically verified speedups of 20–40% over EAGLE (Li et al., 2024).
2. Core Algorithmic Structure
EAGLE-2 augments the classic speculative sampling framework with a dynamic, context-aware tree-building process comprising two principal phases: expansion (tree growth) and reranking (tree pruning), followed by token verification.
Expansion Phase
At each expansion depth, every node in the current draft tree is assigned a "value" , estimating the global probability that the full model will accept the entire path from the root to that node. Formally, for node :
Each is approximated by the draft model’s confidence score , justified by empirical calibration (see Section 4).
Only the top nodes—those with the highest —are expanded, each generating new children via the draft model, where expansion is implemented with batched tree attention to maximize parallelism while containing computational cost.
Reranking Phase
After the tree reaches depth , all nodes are rescored by . The top nodes are selected for verification, with ties broken in favor of shallower nodes to preserve a prefix tree structure. The selected nodes are flattened into a sequence according to tree ancestry, and a custom attention mask is constructed so each token can only attend to its ancestors—ensuring validity for downstream verification with the original LLM.
Verification
The pruned and masked sequence is supplied to the full LLM for batch scoring. Acceptance proceeds sequentially, adopting each token with probability
where denotes the original model and the draft model. On the first rejection, autoregressive decoding resumes from that position, restarting the cycle.
3. Calibration Properties and Theoretical Guarantees
EAGLE-2's dynamic draft selection hinges on the well-calibrated nature of the draft model: for the pairs on Alpaca + Vicuna-7B, tokens with are accepted of the time, while corresponds to acceptance rates near [(Li et al., 2024), Fig. 6]. No additional training or calibration is needed; draft model probabilities are used directly for value estimation.
The algorithm preserves the exact output distribution of standard autoregressive sampling. For any generated sequence ,
by construction, since only the draft selection is altered, with the acceptance criterion and verification unchanged. This fulfills the lossless acceleration property proved for general speculative sampling schemes (Li et al., 2024).
4. Key Mathematical Formulations
Several central definitions and identities structure the EAGLE-2 framework:
- Context-Dependent Acceptance Rate: For candidate with prefix ,
$A(i,\,x_{<t}) = \mathbb{P}[\text{$x_{t+i}$ is accepted by full LLM}]$
which, in classic speculative sampling, is
- Global Acceptance Probability ("value" ):
- Expected Speedup Ratio:
with under idealized cost models, where is the average number of accepted tokens per cycle.
5. Empirical Results
Evaluation is conducted on three LLM series—Vicuna (7B, 13B), LLaMA2-Chat (7B, 13B, 70B), and LLaMA3-Instruct (8B, 70B)—across six representative tasks: MT-bench (chat), HumanEval (code), GSM8K (math reasoning), Alpaca (instruction following), CNN/DM (summarization), and Natural Questions (QA).
Summarized findings:
| Method | Mean Speedup (Vicuna 7B, 6 tasks) | Acceptance Length (avg.) |
|---|---|---|
| Spec. sampling | 1.62× | ~2–2.5 |
| EAGLE-1 (static) | 2.13× | ~3 |
| EAGLE-2 (dynamic) | 3.05× (up to 4.26×: HumanEval) | 4–5.5 |
Ablation studies show a 10% speedup reduction without the value-based expansion (using single-token confidences) and a further 6% reduction with reranking disabled. Deploying both components yields the maximal 3.62× mean speedup, compared to 2.81× with neither (Li et al., 2024).
6. Algorithmic Workflow Sketch
A high-level outline of the dynamic draft tree process is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Input: prefix x, full LLM O, draft model D, depth D, expand-size K_exp, rerank-size M
Initialize Tree T = { root = x }
for depth = 1 to D do
// Expansion phase
Compute value V_i for each leaf node i in T:
V_i ← ∏_{j∈Path(root,i)} D.prob( token at j )
Select top K_exp leaves by V_i
Batch-query D on those leaves to get next tokens and confidences c
Attach children to T accordingly
end for
// Reranking phase
Compute V_i for all nodes in T
Select top M nodes by V_i (favor shallower on ties)
Flatten these M nodes into sequence S respecting tree ancestry
Build attention mask so each token in S only attends to its ancestors
// Verification phase
Send S and mask to O, accept tokens sequentially until first rejection
Append accepted tokens to prefix and repeat until end-of-sequence |
7. Context and Implications
EAGLE-2 demonstrates that speculative sampling’s efficiency can be substantially improved via context-aware dynamic draft trees, utilizing draft model calibration to inform branching decisions. The approach scales across architectures (Vicuna, LLaMA2, LLaMA3) and tasks, with broad implications for fast LLM inference in time-sensitive and compute-constrained deployments. A plausible implication is that further improvements in draft model calibration and adaptive branching strategies could yield additional gains without impacting output fidelity (Li et al., 2024).