Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy (2312.12728v3)

Published 20 Dec 2023 in cs.IR, cs.AI, and cs.LG

Abstract: As LLMs have made significant advancements across various tasks, such as question answering, translation, text summarization, and dialogue systems, the need for accuracy in information becomes crucial, especially for serious financial products serving billions of users like Alipay. However, for a real-world product serving millions of users, the inference speed of LLMs becomes a critical factor compared to a mere experimental model. Hence, this paper presents a generic framework for accelerating the inference process, resulting in a substantial increase in speed and cost reduction for our LLM-based scenarios, with lossless generation accuracy. In the traditional inference process, each token is generated sequentially by the LLM, leading to a time consumption proportional to the number of generated tokens. To enhance this process, our framework, named \textit{lookahead}, introduces a \textit{multi-branch} strategy. Instead of generating a single token at a time, we propose a Trie-based retrieval and verification mechanism to be able to accept several tokens at a forward step. Our strategy offers two distinct advantages: (1) it guarantees absolute correctness of the output, avoiding any approximation algorithms, and (2) the worst-case performance of our approach is equivalent to the conventional process. We conduct extensive experiments to demonstrate the significant improvements achieved by applying our inference acceleration framework. Our framework is widely deployed in Alipay since April 2023, and obtain remarkable 2.66x to 6.26x speedup. Our code is available at https://github.com/alipay/PainlessInferenceAcceleration.

Overview of the Lookahead Framework

Introduction

In the landscape of transformer-based LLMs, while significant progress has been reached in language-based tasks, inference latency during generative tasks remains a critical challenge. When operating on a large scale, the latency issue becomes particularly critical in real-world applications such as those deployed by financial services. An analysis reveals that I/O bandwidth, rather than computational (FLOPs) capacity, is frequently the main performance bottleneck. Existing methods to truncate inference latency, including quantization, sparsity, and distillation, typically manifest a trade-off with accuracy. As such, there is a pivotal need for a solution that not only accelerates inference but also preserves the accuracy of generations.

Acceleration Framework

Lookahead, as developed in this work, is a framework designed for accelerating LLM inference without compromising on generation accuracy. This is particularly important for scenarios where every token's correctness is paramount. The cornerstone of Lookahead's methodology is the employment of a novel multi-branch strategy, which marks a departure from conventional sequential token generation. In a standard inference process, tokens are generated one-by-one in sequence. Lookahead radically changes this by utilizing a Trie-based Retrieval process that generates multiple branches of token sequences simultaneously. After generation, each branch undergoes a Verification and Accept process to determine the longest correct sub-sequence for final output.

Comparative Analysis and Performance Enhancement

The authors provide a comparative analysis of the effectiveness of various acceleration methods applied to LLMs. Lookahead introduces a multi-branch strategy that involves Trie data structures and greatly accelerates token generation, showing substantial performance over other state-of-the-art acceleration methods while maintaining lossless generation accuracy. Importantly, Lookahead is compatible with a range of current LLMs such as GLM, Llama, OPT, BLOOM, and others, and requires minimal code integration.

Conclusion

With an emphasis on empirical data, the authors convincingly argue that they have not only identified the primary challenge of latency in LLM inference but also formulated a robust solution. The Lookahead framework constitutes a leap in ensuring both efficiency and accuracy are not mutually exclusive goals in LLM deployment. It leverages the computational redundancy within GPU architectures, offering a significant improvement in inference velocity. Its successful deployment in a variety of real-world applications within Alipay is a testament to its efficacy, and its upcoming release as open-source positions it as a potentially transformative contribution to LLM infrastructure.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yao Zhao (272 papers)
  2. Zhitian Xie (2 papers)
  3. Chenyi Zhuang (20 papers)
  4. Jinjie Gu (50 papers)
  5. Chen Liang (140 papers)
Citations (7)
Github Logo Streamline Icon: https://streamlinehq.com