TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding (2404.11912v3)

Published 18 Apr 2024 in cs.CL and cs.LG

Abstract: With LLMs widely deployed in long content generation recently, there has emerged an increasing demand for efficient long-sequence inference support. However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length. Due to the auto-regressive nature of LLMs, the entire KV cache will be loaded for every generated token, resulting in low utilization of computational cores and high latency. While various compression methods for KV cache have been proposed to alleviate this issue, they suffer from degradation in generation quality. We introduce TriForce, a hierarchical speculative decoding system that is scalable for long sequence generation. This approach leverages the original model weights and dynamic sparse KV cache via retrieval as a draft model, which serves as an intermediate layer in the hierarchy and is further speculated by a smaller model to reduce its drafting latency. TriForce not only facilitates impressive speedups for Llama2-7B-128K, achieving up to 2.31$\times$ on an A100 GPU but also showcases scalability in handling even longer contexts. For the offloading setting on two RTX 4090 GPUs, TriForce achieves 0.108s/token$\unicode{x2014}$only half as slow as the auto-regressive baseline on an A100, which attains 7.78$\times$ on our optimized offloading system. Additionally, TriForce performs 4.86$\times$ than DeepSpeed-Zero-Inference on a single RTX 4090 GPU. TriForce's robustness is highlighted by its consistently outstanding performance across various temperatures. The code is available at https://github.com/Infini-AI-Lab/TriForce.

PDF HTML Abstract

Hierarchical Speculative Decoding System for Efficient Long-Sequence Inference in LLMs

Introduction

LLMs excel in diverse applications but struggle with efficiency issues during long-sequence generation. A key challenge is the management of the Key-Value (KV) cache, which stores intermediary states to prevent redundant computations but escalates in size with sequence length, causing bottlenecks. Recent solutions involve speculative decoding, where lightweight models predict future tokens which are then validated by the target model. However, these solutions still suffer from various performance drawbacks when scaled to longer sequences.

Dual Bottleneck Observations and Core Insights

The paper identifies two primary memory bottlenecks: model weights and the KV cache. Key insights from the paper indicate:

Attention Sparsity: A large portion of the KV cache is redundant, where only a small subset is actively used, suggesting the feasibility of using partial caches for prediction without significant performance loss.
Contextual Locality: Adjacent tokens often require similar contextual information, suggesting potential efficiency in reusing cache segments across multiple tokens, thereby reducing computational overhead.

These insights lead to the design of TriForce, a hierarchical speculative decoding system that strategically uses partial caching and model weight reduction techniques to address these bottlenecks.

TriForce System Overview

TriForce integrates retrieval-based drafting with hierarchical speculative decoding to address KV cache and model weights bottlenecks effectively:

Retrieval-Based Drafting: Instead of permanently discarding KV pairs like traditional eviction methods, TriForce uses a dynamic retrieval strategy that selectively retains the most crucial KV pairs, enhancing both the efficiency and quality of the inference process.
Hierarchical Speculation: TriForce utilizes a smaller, lightweight model with a partial cache as an initial speculative layer, followed by verifying and refining predictions using the target model with a more complete but selectively reduced cache. This staged speculation reduces overall latency by addressing both identified bottlenecks sequentially.

Empirical Evaluation

TriForce was tested on NVIDIA A100 and RTX 4090 GPUs with models like Llama2-7B-128K:

Speed Improvement: Achieved up to 2.31x speed-up on a single A100 GPU and up to 7.78x speed-up using dual RTX 4090 GPUs in offloading settings.
Robustness and Scalability: Demonstrated high acceptance rates and consistent performance across various temperatures and settings. TriForce shows promising scalability, with theoretical projections suggesting further speed optimizations under extended context simulations.

Observations on KV Cache Handling

In-depth analysis of KV cache management shows:

Optimal Cache Budget: TriForce achieves optimal performance with a 4K KV cache budget, effectively balancing the trade-off between drafting overhead and acceptance rate.
Chunk Size Selection: Analysis reveals that smaller chunk sizes might overfit specific tokens, while excessively large sizes could dilute the significance of valuable tokens, indicating the importance of balanced chunk sizing in retrieval-based strategies.

Future Directions and Theoretical Implications

TriForce's architecture suggests significant potential for extending LLM applicability in real-world scenarios requiring long-context generation, such as document summarization and extended conversational agents. Additionally, the integration with tree-based speculative models could further enhance throughput and efficiency, showing a promising direction for future research in AI language processing optimizations.

Conclusion

This paper presents a compelling approach to solving the efficiency problems associated with LLMs in processing long sequences. By leveraging hierarchical speculative decoding and tactical KV cache management, TriForce not only improves inference speed but also maintains high generational quality, proving to be a substantial advancement in the practical deployment of large-scale LLMs.