Accelerating Inference of LLMs with Token Recycling
Overview
The paper "Turning Trash into Treasure: Accelerating Inference of LLMs with Token Recycling" presents a novel method for accelerating the inference of LLMs. The proposed approach, referred to as Token Recycling, capitalizes on candidate tokens generated during the decoding process, typically discarded in conventional methods. This method leverages an adjacency matrix and breadth-first search (BFS)-like algorithm to recycle these candidate tokens, facilitating faster generation of subsequent sequences without additional training or substantial computational overhead.
Methodology
Token Recycling distinguishes itself through several key components and innovations:
- Adjacency Matrix Initialization: The method employs an adjacency matrix to store candidate tokens generated at each decoding step. This matrix requires less than 2MB of storage, which is accomplished through hot-start initialization.
- Draft Tree Retrieval: Using a BFS-like algorithm, the approach constructs a draft tree from the adjacency matrix. The tree structure, which is static and imbalanced, maximizes the likelihood of selecting high-probability tokens by allocating more branches to these tokens.
- Verification and Update: Once the draft tree is generated, it is validated through tree attention. The adjacency matrix is subsequently updated with new candidate tokens obtained during this process. This ensures that the retrieval space remains dynamic and adaptable to different inputs.
Experimental Validation
Extensive experiments were conducted on the SpecBench and MBPP datasets using Vicuna and Code Llama models across varying parameter sizes (7B, 13B, 33B). The results indicate that Token Recycling consistently achieves significant speedup, approximately 2x, across all model sizes.
Key Metrics
- Mean Accepted Token (MAT): Token Recycling significantly outperformed existing train-free methods, achieving an improvement of over 31%.
- Tokens per Second (Ts/s): The proposed method also demonstrated a high throughput, indicating efficient token generation.
- Speedup: Both general and specialized tasks benefited from the method, with notable improvements in scenarios that require generating new content.
Implications
This research challenges the current speculative decoding paradigms by demonstrating that valuable information persists even in candidate tokens typically discarded. By reusing these "trash" tokens, Token Recycling enhances the efficiency of LLMs without modifying the model architecture or requiring additional training.
Theoretical Implications
Token Recycling introduces a novel speculative decoding mechanism that seamlessly integrates with existing LLM infrastructures. This suggests potential optimization strategies in neural network training, where candidate tokens could play a pivotal role.
Practical Implications
The practical implications are substantial, especially for real-time applications and environments with limited computational resources. The reduction in inference latency opens avenues for deploying LLMs in more dynamic, resource-constrained settings.
Future Directions
Future work could explore different tree structures or adaptive strategies during the construction and update phases. Another potential development lies in examining the applicability of Token Recycling in other neural network architectures beyond LLMs, thereby broadening its utility.
Conclusion
Token Recycling is a compelling approach that leverages previously overlooked candidate tokens to accelerate the inference of LLMs effectively. Its ability to be seamlessly integrated with existing models, combined with its robust performance across various tasks and model sizes, positions it as a valuable strategy for overcoming the latency bottleneck in LLM inferences. The paper's insights and methodology pave the way for further research and optimization in speculative decoding frameworks.