Post-Training Sparse Attention with Double Sparsity (2408.07092v2)

Published 11 Aug 2024 in cs.LG, cs.AI, and cs.CL

Abstract: The inference process for LLMs is slow and memory-intensive, with one of the most critical bottlenecks being excessive Key-Value (KV) cache accesses. This paper introduces "Double Sparsity," a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens. Our key insight is that the pattern of channel sparsity is relatively static, allowing us to use offline calibration to make it efficient at runtime, thereby enabling accurate and efficient identification of important tokens. Moreover, this method can be combined with offloading to achieve significant memory usage reduction. Experimental results demonstrate that Double Sparsity can achieve $\frac{1}{16}$ token and channel sparsity with minimal impact on accuracy across various tasks, including wiki-2 perplexity, key-value retrieval, and long context benchmarks with models including Llama-2-7B, Llama-2-70B, and Mixtral-8x7B. It brings up to a 14.1$\times$ acceleration in attention operations and a 1.9$\times$ improvement in end-to-end inference on GPUs. With offloading, it achieves a decoding speed acceleration of 16.3$\times$ compared to state-of-the-art solutions at a sequence length of 256K. Our code is publicly available at https://github.com/andy-yang-1/DoubleSparse.

PDF HTML Abstract

Post-Training Sparse Attention with Double Sparsity

The paper "Post-Training Sparse Attention with Double Sparsity" addresses the challenge of improving the efficiency of inference in LLMs by proposing a novel post-training sparse attention mechanism. The method, coined "Double Sparsity", aims to alleviate the bottleneck posed by excessive KV cache accesses, thereby accelerating the attention computation process and enhancing overall inference speed.

Introduction

The inference process in LLMs, which requires token-by-token decoding, is known to be slow and memory-intensive due to low arithmetic intensity, making it largely memory-bound. During decoding, the model must frequently access both model weights and the KV cache in the self-attention layers. The KV cache often becomes a critical bottleneck, especially as the sequence length and batch size increase. While previous research has extensively explored reducing access to model weights through quantization and sparsification, the reduction of access to the KV cache has remained less examined.

Methodology

Double Sparsity seeks to reduce the number of accesses to the KV cache by employing a combination of token sparsity and channel sparsity. Token sparsity utilizes only the important tokens for computing self-attention, while channel sparsity identifies important feature channels to determine the significant tokens. Notably, the channel sparsity pattern is relatively static, which allows for efficient offline calibration.

Token Sparsity

Token sparsity is predicated on the observation that not all tokens contribute equally to the decoding of the next token. Therefore, sparse attention methods can achieve nearly the same results by relying on a sparse subset of important tokens. Previous approaches to sparse attention have been limited by either significant accuracy loss or a lack of runtime efficiency. Double Sparsity addresses these shortcomings by efficiently identifying important tokens at runtime using channel sparsity.

Channel Sparsity

Channel sparsity involves selecting important feature channels offline, leveraging the relative static nature of channel sparsity patterns. This static information is then used efficiently at runtime to identify significant tokens for each attention layer. The approach includes offline calibration to determine critical channels, ensuring accurate and efficient identification during decoding.

Results

The experimental results indicate that Double Sparsity achieves $\frac{1}{16}$ token and channel sparsity with minimal impact on accuracy across a variety of benchmarks, including LLMing, question answering, and retrieval tasks. The paper reports up to a 16x acceleration in attention operations and a 1.9x improvement in end-to-end inference speed on GPUs. Furthermore, the implementation of Double Sparsity-Offload reduces GPU memory usage to 1/16 of the original KV cache size without increasing latency, thus significantly enhancing memory efficiency.

Practical and Theoretical Implications

The practical implications of Double Sparsity are substantial for real-world applications requiring efficient LLM deployment, especially in resource-constrained environments. By alleviating the KV cache bottleneck, Double Sparsity enables faster inference speeds, which is crucial for applications in natural language processing, real-time data analysis, and interactive AI systems.

From a theoretical standpoint, the research highlights the importance of understanding the static and dynamic characteristics of model components, such as the sparsity patterns in KV caches. This understanding can lead to the development of more sophisticated methods for optimizing not only model weights but also other critical memory access patterns.

Future Directions

Future research could explore enhancing the asynchronous capabilities of Double Sparsity to better mask communication overheads, aiming for more significant acceleration with minimal memory footprint. Moreover, examining the applicability of Double Sparsity to other types of neural network architectures or expanding it to models with different attention mechanisms could yield further insights and improvements.

Conclusion

Double Sparsity offers a significant advancement in the field of LLM inference optimization by introducing a post-training sparse attention mechanism that effectively combines token and channel sparsity. The approach provides substantial speedups in attention operations and overall inference, with minimal accuracy loss and reduced memory usage. These improvements have the potential to transform the deployment and efficiency of LLMs in various practical applications.