EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test (2503.01840v3)
Abstract: The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE-3 achieves a 1.38x throughput improvement at a batch size of 64. The code is available at https://github.com/SafeAILab/EAGLE.
Summary
- The paper presents a novel method that shifts from feature prediction to direct token prediction to improve LLM inference speed.
- It integrates multi-layer feature fusion during training, overcoming bottlenecks of previous EAGLE variants when scaling training data.
- Experimental results show up to 6.5x speedup and a 1.4x improvement over EAGLE-2 across diverse tasks, demonstrating robust acceleration.
EAGLE-3: Accelerating LLM Inference via Training-Time Test
EAGLE-3 introduces an approach to accelerate LLM inference by refining the speculative sampling paradigm (2503.01840). It builds upon prior work like EAGLE and EAGLE-2, which utilized feature-level autoregression and top-layer features from the target model for drafting candidate tokens. However, EAGLE-3 identifies limitations in this approach, particularly the diminishing returns observed when scaling up the training data for the draft model. The core innovations in EAGLE-3 are the shift from feature prediction to direct token prediction within the draft model and the introduction of a "training-time test" technique for multi-layer feature fusion, replacing the reliance solely on top-layer features. These modifications aim to enhance the draft model's predictive capability and enable it to better leverage large-scale training data, leading to significant inference speedups.
Limitations of Feature Prediction in Prior EAGLE Variants
Previous iterations, such as EAGLE and EAGLE-2, employed draft models trained to predict the features of the target LLM's final layer. The hypothesis was that matching the target model's internal representations would lead to more accurate token predictions. While effective to some extent, this approach faces constraints. Firstly, predicting high-dimensional feature vectors is inherently complex and may not be the most direct path to predicting the next token's probability distribution. Secondly, relying solely on the top-layer features ignores potentially valuable information encoded in intermediate layers of the target LLM. The authors of EAGLE-3 observed empirically that as the training data for the EAGLE draft model was scaled up, the performance gains (in terms of inference speedup) plateaued, suggesting that the feature prediction objective and the single-layer feature source were bottlenecks. The draft model struggled to fully capitalize on the richer information available in larger datasets when constrained by the feature prediction task and limited feature access.
EAGLE-3 Methodology: Token Prediction and Training-Time Test
EAGLE-3 fundamentally changes the draft model's objective and input features.
Direct Token Prediction: Instead of predicting the target model's final layer features, the EAGLE-3 draft model is trained to directly predict the probability distribution of the next token. This aligns the draft model's objective more closely with the ultimate goal of generating candidate tokens for speculative sampling. The loss function typically involves minimizing the cross-entropy between the draft model's predicted token probabilities and the target model's probabilities for the next token, given the context.
Training-Time Test (Multi-Layer Feature Fusion): To provide richer contextual information to the draft model, EAGLE-3 introduces the "training-time test". During the training phase of the draft model, features are extracted from multiple layers of the target LLM for a given input sequence. These multi-layer features are then fused and used as input or conditioning for the draft model. The specific fusion mechanism can vary, but common approaches include concatenation followed by a projection layer, or attention mechanisms that weigh the importance of features from different layers.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
def train_step(target_model, draft_model, input_ids, target_labels): # Get hidden states from multiple layers of the target model with torch.no_grad(): target_outputs = target_model(input_ids, output_hidden_states=True) # Select hidden states from specific intermediate and final layers target_hidden_states = [target_outputs.hidden_states[i] for i in selected_layer_indices] # Extract features corresponding to the last token position for prediction last_token_features = [states[:, -1, :] for states in target_hidden_states] # Fuse the multi-layer features fused_features = fuse_features(last_token_features) # e.g., concatenation + linear layer # Draft model predicts next token probability distribution based on fused features # Note: Draft model might also take input_ids or its own embeddings as input draft_logits = draft_model(input_ids, conditioning_features=fused_features) draft_probs = F.softmax(draft_logits, dim=-1) # Calculate loss against target model's next token prediction (target_labels) # Often uses the target model's distribution if available, or simply the ground truth token loss = cross_entropy_loss(draft_probs, target_labels) # Or KL divergence with target probs # Backpropagate and update draft_model parameters optimizer.zero_grad() loss.backward() optimizer.step() return loss.item() |
This training regime allows the draft model (typically a much smaller transformer) to learn complex token sequence patterns by leveraging the rich, hierarchical representations learned by the larger target LLM across its depth, conditioned on the actual training data distribution.
Inference with EAGLE-3
The inference process follows the standard speculative sampling framework but utilizes the EAGLE-3 trained draft model.
- Drafting: Given the current context (sequence of tokens) x<t, the EAGLE-3 draft model autoregressively generates a sequence of K candidate future tokens x~t:t+K−1. This generation is fast due to the small size of the draft model. During this drafting phase, features from the target model are not typically required, distinguishing it from the training phase. The draft model operates based on its learned parameters.
- Verification: The target LLM processes the combined sequence x<t⊕x~t:t+K−1 in a single forward pass. This yields the target model's probability distributions Ptarget(xi∣x<i) for each position i from t to t+K.
- Acceptance/Rejection: The drafted tokens are compared against the target model's outputs sequentially. For each position i from t to t+K−1:
- Let Pdraft(x~i∣x<t⊕x~t:i−1) be the probability assigned by the draft model to the token x~i it generated.
- Let Ptarget(x~i∣x<t⊕x~t:i−1) be the probability assigned by the target model to the same token x~i.
- Accept x~i if Ptarget(x~i∣...)≥Pdraft(x~i∣...) or with probability Ptarget(x~i∣...)/Pdraft(x~i∣...) if Ptarget(x~i∣...)<Pdraft(x~i∣...) (using typical speculative sampling acceptance rules).
- If x~i is accepted, proceed to check x~i+1.
- If x~i is rejected, discard x~i:t+K−1. Sample the next token xi from a modified distribution derived from Ptarget and Pdraft (e.g., normalize(max(0,Ptarget−Pdraft))). Restart the drafting process from xi.
- Final Token: If all K drafted tokens are accepted, sample the final token xt+K from the target model's distribution Ptarget(x∣x<t⊕x~t:t+K−1).
The speedup arises because the expensive forward passes of the large target LLM are used to verify multiple tokens simultaneously, amortizing the cost. The effectiveness hinges on the draft model's ability to generate sequences that the target model frequently accepts.
Implementation Considerations
- Draft Model Architecture: Typically a smaller transformer (e.g., fewer layers, smaller hidden dimensions) than the target model. The optimal size depends on the target model and the desired trade-off between draft quality and drafting speed.
- Feature Fusion: The choice of layers from the target model and the fusion mechanism (concatenation, attention, gating) are hyperparameters. Experimentation is needed to find the best configuration for a given target LLM. Extracting features from multiple layers during training adds overhead compared to using only the final layer.
- Training Data: Requires access to the target model and ideally a dataset representative of the target model's training or fine-tuning data. The "training-time test" necessitates running inference on the target model to extract hidden states for each training sample, which can be computationally intensive.
- Computational Requirements: Training the draft model requires significant resources, including GPUs capable of holding both the target and draft models (or strategies for distributed feature extraction) and processing the training dataset. Inference requires hosting both the target and draft models, increasing memory footprint compared to standard autoregression. However, the aggregate FLOPs per generated token are expected to decrease significantly if the acceptance rate is high.
- Integration: Implementing EAGLE-3 within existing serving frameworks (like vLLM, TensorRT-LLM, TGI) requires modifications to support the speculative sampling logic, including running the draft model, parallel verification by the target model, and the acceptance/rejection mechanism. The communication overhead between draft and target model components should be minimized. The provided codebase (github.com/SafeAILab/EAGLE) serves as a starting point.
Experimental Results
The paper reports significant speedups using EAGLE-3 across various chat and reasoning models and tasks.
- Speedup: Achieves up to 6.5x speedup compared to standard autoregressive decoding on the target LLM.
- Improvement over EAGLE-2: Demonstrates approximately 1.4x higher speedup compared to the previous EAGLE-2 method, highlighting the benefits of direct token prediction and multi-layer feature fusion.
- Scalability: Crucially, the results suggest that EAGLE-3 draft models benefit more effectively from scaling up training data compared to prior EAGLE versions, validating the design changes made to overcome previous limitations. Evaluations were performed on five distinct tasks, demonstrating robustness.
These results position EAGLE-3 as a highly effective technique for reducing LLM inference latency. The numerical gains, particularly the substantial improvement over its predecessor and the high absolute speedup factor, underscore the efficacy of the proposed methodological shifts.
Conclusion
EAGLE-3 presents a refinement of speculative sampling for LLM inference acceleration. By transitioning the draft model from feature prediction to direct token prediction and incorporating multi-layer features from the target model via the "training-time test", it overcomes limitations observed in prior work. The method allows the draft model to better
Follow-up Questions
We haven't generated follow-up questions for this paper yet.
Related Papers
Authors (4)
HackerNews
- Eagle-3 Speculative Decoding for LLM Inference (5.6x speedup) (2 points, 0 comments)