Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling (2408.08696v3)

Published 16 Aug 2024 in cs.CL and cs.LG

Abstract: Massive parameters of LLMs have made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures to guess draft tokens, which need extra training before use. Alternatively, retrieval-based training-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. It stores candidate tokens in an adjacency matrix and employs a breadth-first-search (BFS)-like algorithm to construct a draft tree, which is then validated through tree attention. New candidate tokens from the decoding process are then used to update the matrix. Token Recycling requires \textless2MB of additional storage and achieves approximately 2x speedup across all sizes of LLMs. It significantly outperforms existing train-free methods by 30\% and even a widely recognized training method by 25\%.

Authors (7)

Xianzhen Luo (11 papers)
Yixuan Wang (95 papers)
Qingfu Zhu (39 papers)
Zhiming Zhang (17 papers)
Xuanyu Zhang (34 papers)
Qing Yang (138 papers)
Dongliang Xu (19 papers)

Citations (1)

View on Semantic Scholar

Summary

Accelerating Inference of LLMs with Token Recycling

Overview

The paper "Turning Trash into Treasure: Accelerating Inference of LLMs with Token Recycling" presents a novel method for accelerating the inference of LLMs. The proposed approach, referred to as Token Recycling, capitalizes on candidate tokens generated during the decoding process, typically discarded in conventional methods. This method leverages an adjacency matrix and breadth-first search (BFS)-like algorithm to recycle these candidate tokens, facilitating faster generation of subsequent sequences without additional training or substantial computational overhead.

Methodology

Token Recycling distinguishes itself through several key components and innovations:

Adjacency Matrix Initialization: The method employs an adjacency matrix to store candidate tokens generated at each decoding step. This matrix requires less than 2MB of storage, which is accomplished through hot-start initialization.
Draft Tree Retrieval: Using a BFS-like algorithm, the approach constructs a draft tree from the adjacency matrix. The tree structure, which is static and imbalanced, maximizes the likelihood of selecting high-probability tokens by allocating more branches to these tokens.
Verification and Update: Once the draft tree is generated, it is validated through tree attention. The adjacency matrix is subsequently updated with new candidate tokens obtained during this process. This ensures that the retrieval space remains dynamic and adaptable to different inputs.

Experimental Validation

Extensive experiments were conducted on the SpecBench and MBPP datasets using Vicuna and Code Llama models across varying parameter sizes (7B, 13B, 33B). The results indicate that Token Recycling consistently achieves significant speedup, approximately 2x, across all model sizes.

Key Metrics

Mean Accepted Token (MAT): Token Recycling significantly outperformed existing train-free methods, achieving an improvement of over 31%.
Tokens per Second (Ts/s): The proposed method also demonstrated a high throughput, indicating efficient token generation.
Speedup: Both general and specialized tasks benefited from the method, with notable improvements in scenarios that require generating new content.

Implications

This research challenges the current speculative decoding paradigms by demonstrating that valuable information persists even in candidate tokens typically discarded. By reusing these "trash" tokens, Token Recycling enhances the efficiency of LLMs without modifying the model architecture or requiring additional training.

Theoretical Implications

Token Recycling introduces a novel speculative decoding mechanism that seamlessly integrates with existing LLM infrastructures. This suggests potential optimization strategies in neural network training, where candidate tokens could play a pivotal role.

Practical Implications

The practical implications are substantial, especially for real-time applications and environments with limited computational resources. The reduction in inference latency opens avenues for deploying LLMs in more dynamic, resource-constrained settings.

Future Directions

Future work could explore different tree structures or adaptive strategies during the construction and update phases. Another potential development lies in examining the applicability of Token Recycling in other neural network architectures beyond LLMs, thereby broadening its utility.

Conclusion

Token Recycling is a compelling approach that leverages previously overlooked candidate tokens to accelerate the inference of LLMs effectively. Its ability to be seamlessly integrated with existing models, combined with its robust performance across various tasks and model sizes, positions it as a valuable strategy for overcoming the latency bottleneck in LLM inferences. The paper's insights and methodology pave the way for further research and optimization in speculative decoding frameworks.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1825346704054337730

https://twitter.com/flat/status/1859615258987368484

YouTube

Show All Videos