Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (2401.15077v3)

Published 26 Jan 2024 in cs.LG and cs.CL

Abstract: Autoregressive decoding makes the inference of LLMs time-consuming. In this paper, we reconsider speculative sampling and derive two key observations. Firstly, autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level. Secondly, the inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance. Based on these insights, we introduce EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a simple yet highly efficient speculative sampling framework. By incorporating a token sequence advanced by one time step, EAGLE effectively resolves the uncertainty, enabling precise second-to-top-layer feature prediction with minimal overhead. We conducted comprehensive evaluations of EAGLE, including all models from the Vicuna and LLaMA2-Chat series, the MoE model Mixtral 8x7B Instruct, and tasks in dialogue, code generation, mathematical reasoning, and instruction following. For LLaMA2-Chat 70B, EAGLE achieved a latency speedup ratio of 2.7x-3.5x, doubled throughput, while maintaining the distribution of the generated text.

Citations (70)

Summary

  • The paper introduces EAGLE, a framework that accelerates auto-regressive inference by leveraging feature-level speculative sampling without fine-tuning the target LLM.
  • The paper demonstrates that the feature&shifted-token method combined with tree attention achieves a threefold speed boost while preserving the original output distribution.
  • The paper highlights EAGLE’s low training overhead and compatibility with techniques like quantization, making it practical for deployment in latency-sensitive environments.

Overview

The newly presented framework, EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), addresses the challenge of auto-regressive inference latency in LLMs by innovating on the speculative sampling paradigm. EAGLE differentiates itself by focusing on lossless acceleration, wherein it achieves enhanced generation speed without fine-tuning the target LLM, thus maintaining the original distribution of the generated text.

Drafting Process

EAGLE strategizes its acceleration by working auto-regressively on second-top-layer features, which regularizes uncertainty in next-feature predictions more effectively than the token level. This regularization is crucial, as speculative sampling has hitherto relied on selecting a subsidiary 'draft' model, typically a scaled-down version of the original LLM, to act as a preliminary generator of tokens. A key aspect of EAGLE's drafting phase is its input arrangement, which includes not only the feature sequence but also a token sequence advanced by one time step. This 'feature&shifted-token' method outperforms other strategies by accounting for sampling randomness inherent in LLM output prediction.

Acceleration and Efficiency

Empirical evidence from benchmarks like MT-bench shows EAGLE's significant acceleration qualities. It provides a threefold speed boost over traditional auto-regressive backing while also surpassing other modern speculative sampling frameworks. EAGLE's verification phase uses 'tree attention', facilitating the processing of multiple tokens in one go and further contributing to its overall acceleration attributes. Additionally, the framework exhibits robustness against feature errors and efficiently manages error accumulation. Moreover, EAGLE is compatible with other latency-reducing techniques, presenting a synergy with methods such as quantization and compilation.

Practical Applications and Training

Notably, EAGLE exhibits low susceptibility to the source of training data, allowing for the use of a fixed dataset rather than data generated by the target LLM, therefore curtailing training overhead. For implementation in practical applications, EAGLE offers immediate utility upon a single training session, with the ongoing use effectively rendering the amortized training costs negligible. The ease of deployment within a production environment and the maintenance of the integrity of generated content demonstrate EAGLE's potential for immediate and widespread adoption in LLM inference tasks.

Conclusion

EAGLE sets a precedent in the landscape of LLM acceleration by providing a framework that is not only faster but also preserves the fidelity of the LLM's output. Its approach by integrating shifted-token sequences and executing auto-regression at the feature level is a promising advancement for the deployment of LLMs in latency-sensitive environments. The ability to maintain the output distribution of the LLM unchanged while significantly enhancing performance is the cornerstone of EAGLE's contribution to the field.

Github Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com