EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees (2406.16858v2)

Published 24 Jun 2024 in cs.CL and cs.LG

Abstract: Inference with modern LLMs is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly assuming that the acceptance rate of draft tokens depends only on their position. Interestingly, we found that the acceptance rate of draft tokens is also context-dependent. In this paper, building upon EAGLE, we propose EAGLE-2, which introduces a new technique of context-aware dynamic draft tree into drafting modeling. This improvement leverages the fact that the draft model of EAGLE is well-calibrated: the confidence scores from the draft model approximate acceptance rates with small errors. We conducted extensive evaluations on three series of LLMs and six tasks, with EAGLE-2 achieving speedup ratios 3.05x-4.26x, which is 20%-40% faster than EAGLE-1. EAGLE-2 also ensures that the distribution of the generated text remains unchanged, making it a lossless acceleration algorithm.

Citations (20)

View on Semantic Scholar

Summary

The paper presents a dynamic draft tree method that adjusts token acceptance rates based on context to achieve lossless speedup in LLM inference.
It leverages a well-calibrated draft model with dynamic expansion and reranking to optimize token generation without sacrificing quality.
Experiments across various tasks show speedups of 3.05x to 4.26x and a 20%-40% improvement over previous methods, confirming its practical impact.

Dynamic Draft Trees for Faster Inference: An Overview of EAGLE-2

The advent of LLMs has propelled significant advancements in various applications, including natural language understanding, generation, and translation. However, the computational cost and time-consuming nature of LLM inference emerge as crucial bottlenecks. This paper introduces EAGLE-2, a sophisticated enhancement to existing speculative sampling methods designed to accelerate the inference process of LLMs by leveraging dynamic draft trees that are context-aware.

Core Contributions

The primary contribution of EAGLE-2 lies in its dynamic adjustment of the draft tree structure based on context, a departure from the static nature prevalent in previous methods like EAGLE and Medusa. This dynamic nature allows for an optimization that takes into consideration the context-dependent acceptance rates of draft tokens, enabling a more efficient and targeted inference process.

Methodology

EAGLE-2 builds upon the foundation laid by its predecessor, EAGLE-1. In speculative sampling, draft tokens are generated quickly and then verified against the original LLM. Typical speculative sampling assumes a static draft tree, implicitly considering the acceptance rate of draft tokens to be solely position-dependent. This paper challenges that assumption by demonstrating the context dependency of these rates.

Key elements of the methodology include:

Well-Calibrated Draft Model: The draft model used in EAGLE-2 is well-calibrated, meaning that its confidence scores (probabilities of token generation) are good approximations of the acceptance rate. This insight is central to the dynamic adjustment proposed in this work.
Dynamic Expansion and Reranking: EAGLE-2 employs a context-aware dynamic draft tree structure, where the tree is adjusted based on the confidence scores of the draft tokens. Nodes with the highest product of confidence scores are prioritized for expansion. After expansion, the draft tokens are reranked to form the most promising sequence for verification by the original LLM.
Lossless Acceleration: Ensuring that the generated text's distribution remains unchanged compared to the original LLM confirms that EAGLE-2 is a lossless acceleration method. It achieves substantial speed-ups without compromising the quality of the generated outputs.

Experimental Validation

EAGLE-2's performance was rigorously evaluated on several datasets and tasks, including multi-turn conversations (MT-bench), code generation (HumanEval), mathematical reasoning (GSM8K), instruction following (Alpaca), summarization (CNN/DailyMail), and question answering (Natural Questions). The experiments spanned three series of LLMs: Vicuna, LLaMA2-Chat, and LLaMA3-Instruct.

Significant numerical results include:

Achieving speedups ranging from 3.05x to 4.26x across various tasks and models.
Outperforming EAGLE-1 by 20%-40% in terms of speedup ratios.
Demonstrating a marked improvement in average acceptance length (the number of tokens accepted per draft-verification cycle), which translates to higher efficiency in the inference process.

Implications and Future Directions

The implications of this research are twofold:

Practical: By providing substantial speedups in LLM inference, EAGLE-2 can significantly reduce the computational resources required for deploying large models in practical applications. This has direct benefits for areas where response time and efficiency are critical, such as conversational agents and real-time translation.
Theoretical: The context-aware adjustment of the draft tree opens up new avenues in the paper of token acceptance rates and their relation to context, challenging longstanding assumptions in speculative sampling.

Future research could explore further refinements in the calibration of draft models and extend the dynamic strategies introduced in EAGLE-2 to other domains of model inference. Additionally, integrating EAGLE-2 with emerging LLM architectures and extending its applicability to a broader range of generative tasks could provide further insights and enhancements.

Conclusion

EAGLE-2 presents a notable advancement in the field of speculative sampling for LLM inference. Its dynamic, context-aware draft tree structure showcases a well-thought-out approach to optimizing the generation and verification stages, achieving enhanced speed and efficiency without compromising the integrity of the output. By addressing the context dependency of token acceptance rates, it sets a new benchmark for lossless acceleration in LLM inference and opens the door for future innovations in this rapidly evolving field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1805435131928178715

https://twitter.com/hongyangzh/status/1806309080979386808

https://twitter.com/fly51fly/status/1805720350938055157

https://twitter.com/gm8xx8/status/1805432238286197057

https://twitter.com/chwapnil/status/1806611901167943988

https://twitter.com/etcex/status/1876431272085713089