Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding (2404.08698v2)

Published 10 Apr 2024 in cs.CL and cs.LG

Abstract: While LLMs have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that employs an N-gram module, which adapts based on the current interactive context, followed by a verification phase, during which the original LLM assesses and confirms the proposed tokens. Consequently, ANPD preserves the integrity of the LLM's original output while enhancing processing speed. We further leverage a multi-level architecture for the N-gram module to enhance the precision of the initial draft, consequently reducing inference latency. ANPD eliminates the need for retraining or extra GPU memory, making it an efficient and plug-and-play enhancement. In our experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x, validating the effectiveness of our proposed ANPD.

Accelerating LLM Inference with Adaptive N-gram Parallel Decoding (ANPD)

Introduction

Recent advancements in LLMs have been accompanied by increased computational demands, particularly during the inference phase. This paper introduces an innovative method, Adaptive N-gram Parallel Decoding (ANPD), which improves the inference efficiency of LLMs. ANPD uses a unique two-stage approach that combines rapid parallel token generation via an N-gram model with a subsequent verification process by the LLM, ensuring output accuracy while markedly reducing latency.

Core Contributions

The primary contributions of this paper are:

  • The development of ANPD, a plug-and-play approach designed to speed up LLM inference without the need for additional training or significant changes to existing model architectures.
  • Integration of an adaptive N-gram modeling technique which simplifies LLMing while retaining high accuracy and reducing the reliance on large textual datasets.
  • Introduction of a Multi-Level N-gram (MLN) algorithm to increase the precision of the draft outputs, thus enhancing the efficiency of the acceleration process.
  • Comprehensive validation of ANPD's effectiveness across various models and datasets, demonstrating substantial improvements in processing speed.

Methodology

Adaptive N-gram Parallel Decoding (ANPD)

ANPD enhances LLM inference by generating multiple potential token sequences via an N-gram module, which are then confirmed through the LLM. This process involves initializing with tokenization, followed by dynamic updates to the N-gram predictions based on inputs from running text, forming a cyclical generation and verification process.

Multi-Level N-gram (MLN)

To balance performance and accuracy, ANPD incorporates a Multi-Level N-gram setup, which helps in managing the sparsity of higher-order N-grams that might not match successfully by falling back to smaller, more reliable N-grams.

Experiments

Comprehensive experiments were conducted with several model variants, including LLaMA and its variants, across different tasks like text summarization and code generation. For the models assessed, speed improvements ranged between 1.95x and 3.67x. The experiments utilized multiple datasets such as CNN/Daily Mail and XSUM for text, and HumanEval for code, to ensure diverse, robust assessment.

Results and Discussion

The ANPD method demonstrated consistent performance improvements across different models and tasks. For instance, in text summarization tasks with LLaMA-7B, a speed improvement of approximately 2.98x was observed. Moreover, in code generation tasks using the CodeLLaMa-13B model, the speed improvement was as high as 3.67x. These results underline the effectiveness of ANPD in reducing inference times while maintaining the output quality of the original LLMs.

Future Directions

Moving forward, the exploration could extend to:

  • Adapting ANPD to exploit specific features of different LLMs to optimize its effectiveness further.
  • Extending the parallel processing capabilities during the LLM's verification phase to enhance performance gains.

Conclusion

The paper presents a scalable and efficient method to accelerate the inference time of LLMs without compromising on the quality of output. ANPD's ability to integrate seamlessly as a plug-and-play solution makes it highly applicable for real-world uses, providing a substantial boost in processing speeds across various AI-driven applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jie Ou (13 papers)
  2. Yueming Chen (1 paper)
  3. Wenhong Tian (24 papers)
Citations (6)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com