- The paper introduces EAGLE, a framework that accelerates auto-regressive inference by leveraging feature-level speculative sampling without fine-tuning the target LLM.
- The paper demonstrates that the feature&shifted-token method combined with tree attention achieves a threefold speed boost while preserving the original output distribution.
- The paper highlights EAGLE’s low training overhead and compatibility with techniques like quantization, making it practical for deployment in latency-sensitive environments.
Overview
The newly presented framework, EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), addresses the challenge of auto-regressive inference latency in LLMs by innovating on the speculative sampling paradigm. EAGLE differentiates itself by focusing on lossless acceleration, wherein it achieves enhanced generation speed without fine-tuning the target LLM, thus maintaining the original distribution of the generated text.
Drafting Process
EAGLE strategizes its acceleration by working auto-regressively on second-top-layer features, which regularizes uncertainty in next-feature predictions more effectively than the token level. This regularization is crucial, as speculative sampling has hitherto relied on selecting a subsidiary 'draft' model, typically a scaled-down version of the original LLM, to act as a preliminary generator of tokens. A key aspect of EAGLE's drafting phase is its input arrangement, which includes not only the feature sequence but also a token sequence advanced by one time step. This 'feature&shifted-token' method outperforms other strategies by accounting for sampling randomness inherent in LLM output prediction.
Acceleration and Efficiency
Empirical evidence from benchmarks like MT-bench shows EAGLE's significant acceleration qualities. It provides a threefold speed boost over traditional auto-regressive backing while also surpassing other modern speculative sampling frameworks. EAGLE's verification phase uses 'tree attention', facilitating the processing of multiple tokens in one go and further contributing to its overall acceleration attributes. Additionally, the framework exhibits robustness against feature errors and efficiently manages error accumulation. Moreover, EAGLE is compatible with other latency-reducing techniques, presenting a synergy with methods such as quantization and compilation.
Practical Applications and Training
Notably, EAGLE exhibits low susceptibility to the source of training data, allowing for the use of a fixed dataset rather than data generated by the target LLM, therefore curtailing training overhead. For implementation in practical applications, EAGLE offers immediate utility upon a single training session, with the ongoing use effectively rendering the amortized training costs negligible. The ease of deployment within a production environment and the maintenance of the integrity of generated content demonstrate EAGLE's potential for immediate and widespread adoption in LLM inference tasks.
Conclusion
EAGLE sets a precedent in the landscape of LLM acceleration by providing a framework that is not only faster but also preserves the fidelity of the LLM's output. Its approach by integrating shifted-token sequences and executing auto-regression at the feature level is a promising advancement for the deployment of LLMs in latency-sensitive environments. The ability to maintain the output distribution of the LLM unchanged while significantly enhancing performance is the cornerstone of EAGLE's contribution to the field.