Papers
Topics
Authors
Recent
2000 character limit reached

EAGLE-2: Dynamic Lossless LLM Acceleration

Updated 29 December 2025
  • The paper introduces EAGLE-2, a method that employs a dynamic, context-aware draft tree to enhance speculative sampling without altering the generated distribution.
  • EAGLE-2 uses a two-phase process—expansion based on confidence-derived value estimation and reranking of candidate tokens—to achieve empirical speedups of 20–40% over prior methods.
  • The algorithm maintains token acceptance fidelity of the original model while significantly increasing parallelism and practical inference efficiency in large language models.

EAGLE-2 is a lossless acceleration algorithm for LLM inference that extends speculative sampling and the static-tree-based EAGLE method by introducing a context-aware dynamic draft tree. By leveraging the calibration properties of the draft model to dynamically adjust the speculative sampling process, EAGLE-2 achieves increased parallelism and higher token acceptance rates without altering the probability distribution of generated sequences (Li et al., 2024).

1. Motivation and Background

A principal bottleneck in decoding with LLMs is the linear dependence of inference cost on sequence length, since generation typically requires one forward pass per token. Speculative sampling mitigates this by using a smaller draft model to generate multiple candidate tokens in parallel, then verifying these candidates with the full model, accepting as many tokens as possible in a single batch.

Prior methods such as EAGLE ("EAGLE: Efficient Autoregressive Generation with Look-ahead Evidence," Li et al., ICML 2024) employ static draft trees—predefined k-ary tree shapes for speculative sampling. These static trees implicitly assume homogeneity in draft token acceptance rates by position, neglecting the strong empirical context dependence observed in practice. Consequently, they fail to optimally allocate computational budget in "easy" versus "hard" semantic contexts.

EAGLE-2 addresses these limitations by constructing the draft tree dynamically, informed by the draft model’s confidence scores, and targets maximization of the expected acceptance length per speculative batch—yielding empirically verified speedups of 20–40% over EAGLE (Li et al., 2024).

2. Core Algorithmic Structure

EAGLE-2 augments the classic speculative sampling framework with a dynamic, context-aware tree-building process comprising two principal phases: expansion (tree growth) and reranking (tree pruning), followed by token verification.

Expansion Phase

At each expansion depth, every node in the current draft tree is assigned a "value" ViV_i, estimating the global probability that the full model will accept the entire path from the root to that node. Formally, for node tit_i:

Vi=jPath(root,ti)pj(orig)V_i = \prod_{j \in Path(root,\, t_i)} p_j^{(orig)}

Each pj(orig)p_j^{(orig)} is approximated by the draft model’s confidence score cj=pD(tj  prefix)c_j = p^D(t_j~|~prefix), justified by empirical calibration (see Section 4).

Only the top KexpandK_{\text{expand}} nodes—those with the highest ViV_i—are expanded, each generating kk new children via the draft model, where expansion is implemented with batched tree attention to maximize parallelism while containing computational cost.

Reranking Phase

After the tree reaches depth DD, all nodes are rescored by ViV_i. The top MM nodes are selected for verification, with ties broken in favor of shallower nodes to preserve a prefix tree structure. The selected nodes are flattened into a sequence according to tree ancestry, and a custom attention mask is constructed so each token can only attend to its ancestors—ensuring validity for downstream verification with the original LLM.

Verification

The pruned and masked sequence is supplied to the full LLM for batch scoring. Acceptance proceeds sequentially, adopting each token with probability

min(1,pO(xt+i)pD(xt+i))\min\left(1, \frac{p^O(x_{t+i})}{p^D(x_{t+i})}\right)

where pOp^O denotes the original model and pDp^D the draft model. On the first rejection, autoregressive decoding resumes from that position, restarting the cycle.

3. Calibration Properties and Theoretical Guarantees

EAGLE-2's dynamic draft selection hinges on the well-calibrated nature of the draft model: for the pairs (c,acceptance rate)(c, \text{acceptance rate}) on Alpaca + Vicuna-7B, tokens with c<0.05c<0.05 are accepted 4%\approx4\% of the time, while c>0.95c>0.95 corresponds to acceptance rates near 98%98\% [(Li et al., 2024), Fig. 6]. No additional training or calibration is needed; draft model probabilities are used directly for value estimation.

The algorithm preserves the exact output distribution of standard autoregressive sampling. For any generated sequence yy,

PrEAGLE-2(y)=tpO(yty<t)=Prvanilla(y)\Pr_{\text{EAGLE-2}}(y) = \prod_t p^O(y_t | y_{<t}) = \Pr_{\text{vanilla}}(y)

by construction, since only the draft selection is altered, with the acceptance criterion and verification unchanged. This fulfills the lossless acceleration property proved for general speculative sampling schemes (Li et al., 2024).

4. Key Mathematical Formulations

Several central definitions and identities structure the EAGLE-2 framework:

  • Context-Dependent Acceptance Rate: For candidate xt+ix_{t+i} with prefix x<tx_{<t},

$A(i,\,x_{<t}) = \mathbb{P}[\text{$x_{t+i}$ is accepted by full LLM}]$

which, in classic speculative sampling, is

A(i,x<t)=Ecdraft[min(1,pO(c  x<t+i1)pD(c  x<t+i1))]A(i,\,x_{<t}) = \mathbb{E}_{c \sim \text{draft}} \left[ \min\left(1, \frac{p^O(c~|~x_{<t+i-1})}{p^D(c~|~x_{<t+i-1})} \right) \right ]

  • Global Acceptance Probability ("value" ViV_i):

Vi=jPath(root,ti)pj(orig)jcjV_i = \prod_{j \in Path(root,\, t_i)} p_j^{(orig)} \approx \prod_{j} c_j

  • Expected Speedup Ratio:

S:=Tbase/Teagle2S := T_{\text{base}} / T_{\text{eagle2}}

with Teagle2Tbase/E[Laccept]T_{\text{eagle2}} \approx T_{\text{base}} / \mathbb{E}[L_{\text{accept}}] under idealized cost models, where E[Laccept]\mathbb{E}[L_{\text{accept}}] is the average number of accepted tokens per cycle.

5. Empirical Results

Evaluation is conducted on three LLM series—Vicuna (7B, 13B), LLaMA2-Chat (7B, 13B, 70B), and LLaMA3-Instruct (8B, 70B)—across six representative tasks: MT-bench (chat), HumanEval (code), GSM8K (math reasoning), Alpaca (instruction following), CNN/DM (summarization), and Natural Questions (QA).

Summarized findings:

Method Mean Speedup (Vicuna 7B, 6 tasks) Acceptance Length (avg.)
Spec. sampling 1.62× ~2–2.5
EAGLE-1 (static) 2.13× ~3
EAGLE-2 (dynamic) 3.05× (up to 4.26×: HumanEval) 4–5.5

Ablation studies show a 10% speedup reduction without the value-based expansion (using single-token confidences) and a further 6% reduction with reranking disabled. Deploying both components yields the maximal 3.62× mean speedup, compared to 2.81× with neither (Li et al., 2024).

6. Algorithmic Workflow Sketch

A high-level outline of the dynamic draft tree process is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Input: prefix x, full LLM O, draft model D, depth D, expand-size K_exp, rerank-size M
Initialize Tree T = { root = x }
for depth = 1 to D do
  // Expansion phase
  Compute value V_i for each leaf node i in T:
    V_i ← ∏_{j∈Path(root,i)} D.prob( token at j )
  Select top K_exp leaves by V_i
  Batch-query D on those leaves to get next tokens and confidences c
  Attach children to T accordingly
end for
// Reranking phase
Compute V_i for all nodes in T
Select top M nodes by V_i (favor shallower on ties)
Flatten these M nodes into sequence S respecting tree ancestry
Build attention mask so each token in S only attends to its ancestors
// Verification phase
Send S and mask to O, accept tokens sequentially until first rejection
Append accepted tokens to prefix and repeat until end-of-sequence
(Li et al., 2024)

7. Context and Implications

EAGLE-2 demonstrates that speculative sampling’s efficiency can be substantially improved via context-aware dynamic draft trees, utilizing draft model calibration to inform branching decisions. The approach scales across architectures (Vicuna, LLaMA2, LLaMA3) and tasks, with broad implications for fast LLM inference in time-sensitive and compute-constrained deployments. A plausible implication is that further improvements in draft model calibration and adaptive branching strategies could yield additional gains without impacting output fidelity (Li et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to EAGLE-2.