Accelerating LLM Inference with Adaptive N-gram Parallel Decoding (ANPD)
Introduction
Recent advancements in LLMs have been accompanied by increased computational demands, particularly during the inference phase. This paper introduces an innovative method, Adaptive N-gram Parallel Decoding (ANPD), which improves the inference efficiency of LLMs. ANPD uses a unique two-stage approach that combines rapid parallel token generation via an N-gram model with a subsequent verification process by the LLM, ensuring output accuracy while markedly reducing latency.
Core Contributions
The primary contributions of this paper are:
- The development of ANPD, a plug-and-play approach designed to speed up LLM inference without the need for additional training or significant changes to existing model architectures.
- Integration of an adaptive N-gram modeling technique which simplifies LLMing while retaining high accuracy and reducing the reliance on large textual datasets.
- Introduction of a Multi-Level N-gram (MLN) algorithm to increase the precision of the draft outputs, thus enhancing the efficiency of the acceleration process.
- Comprehensive validation of ANPD's effectiveness across various models and datasets, demonstrating substantial improvements in processing speed.
Methodology
Adaptive N-gram Parallel Decoding (ANPD)
ANPD enhances LLM inference by generating multiple potential token sequences via an N-gram module, which are then confirmed through the LLM. This process involves initializing with tokenization, followed by dynamic updates to the N-gram predictions based on inputs from running text, forming a cyclical generation and verification process.
Multi-Level N-gram (MLN)
To balance performance and accuracy, ANPD incorporates a Multi-Level N-gram setup, which helps in managing the sparsity of higher-order N-grams that might not match successfully by falling back to smaller, more reliable N-grams.
Experiments
Comprehensive experiments were conducted with several model variants, including LLaMA and its variants, across different tasks like text summarization and code generation. For the models assessed, speed improvements ranged between 1.95x and 3.67x. The experiments utilized multiple datasets such as CNN/Daily Mail and XSUM for text, and HumanEval for code, to ensure diverse, robust assessment.
Results and Discussion
The ANPD method demonstrated consistent performance improvements across different models and tasks. For instance, in text summarization tasks with LLaMA-7B, a speed improvement of approximately 2.98x was observed. Moreover, in code generation tasks using the CodeLLaMa-13B model, the speed improvement was as high as 3.67x. These results underline the effectiveness of ANPD in reducing inference times while maintaining the output quality of the original LLMs.
Future Directions
Moving forward, the exploration could extend to:
- Adapting ANPD to exploit specific features of different LLMs to optimize its effectiveness further.
- Extending the parallel processing capabilities during the LLM's verification phase to enhance performance gains.
Conclusion
The paper presents a scalable and efficient method to accelerate the inference time of LLMs without compromising on the quality of output. ANPD's ability to integrate seamlessly as a plug-and-play solution makes it highly applicable for real-world uses, providing a substantial boost in processing speeds across various AI-driven applications.