RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval (2409.10516v2)

Published 16 Sep 2024 in cs.LG and cs.CL

Abstract: Transformer-based LLMs have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference latency and high GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to both accelerate attention computation and reduce GPU memory consumption. By leveraging the dynamic sparsity of attention mechanism, RetrievalAttention proposes to use approximate nearest neighbor search (ANNS) indexes for KV vectors in CPU memory and retrieves the most relevant ones with vector search during generation. Unfortunately, we observe that the off-the-shelf ANNS indexes are often ineffective for such retrieval tasks due to the out-of-distribution (OOD) between query vectors and key vectors in attention mechanism. RetrievalAttention addresses the OOD challenge by designing an attention-aware vector search algorithm that can adapt to the distribution of query vectors. Our evaluation shows that RetrievalAttention only needs to access 1--3% of data while maintaining high model accuracy. This leads to significant reduction in the inference cost of long-context LLMs with much lower GPU memory footprint. In particular, RetrievalAttention only needs a single NVIDIA RTX4090 (24GB) for serving 128K tokens in LLMs with 8B parameters, which is capable of generating one token in 0.188 seconds.

PDF Abstract

Effective Acceleration of Long-Context LLM Inference with RetrievalAttention

The paper "RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval" provides an advanced solution to the challenge of the high computational cost associated with long-context processing in Transformer-based LLMs. The core motivation behind this work is the quadratic time complexity of attention operations, which results in substantial inference latency and GPU memory consumption, especially for long-context scenarios.

Overview of the Approach

The proposed solution, RetrievalAttention, leverages a training-free method to accelerate attention computation by incorporating Approximate Nearest Neighbor Search (ANNS) indices with Key-Value (KV) vectors stored in CPU memory. RetrievalAttention addresses key challenges in scaling LLMs to longer contexts:

Inference Latency and GPU Memory: By caching KV vectors during inference computation, traditional KV-caching approaches require massive GPU memory, which may become unfeasible when context sizes extend into millions of tokens.
Dynamic Sparsity of Attention: Typically, only a minimal subset of KV vectors contributes to the attention result for any given query, highlighting the inefficiency of full attention computation.

The method dynamically retrieves the most relevant KV vectors using vector retrieval techniques, reducing the necessity of exhaustive KV vector scanning.

Technical Contributions and Execution Strategy

Out-of-Distribution (OOD) Challenge: The discrepancy between query vectors and key vectors, stemming from different distributions, necessitates scanning a large portion of KV vectors in traditional ANNS, which RetrievalAttention mitigates. By proposing an attention-aware vector search algorithm, the authors reduced this necessity to approximately 1–3% from an otherwise 30% scan requirement.
Efficiency and Memory Use: RetrievalAttention lowers the GPU memory footprint significantly and performs attention computation efficiently, even for models handling up to 128K tokens with 8B parameters using a single NVIDIA RTX 4090 GPU (24GB). The solution retains consistently important KV vectors in GPU memory and offloads dynamically relevant vectors to CPU memory.

Numerical Results and Implications

Decoding Latency: For models like Llama-3-8B with a prompt length of 128k tokens, RetrievalAttention achieves up to a 4.9× reduction in decoding latency compared to traditional KNN-based retrieval and experience only a 0.188 seconds latency per token generation.
Accuracy: The accuracy maintained by RetrievalAttention in various long-context benchmarks, e.g., $\infty$ -Bench and RULER, shows negligible deviation from full attention computation, which is critical for practical applicability.

Implications and Future Directions

Theoretically, RetrievalAttention sets a precedent for employing sparse dynamic retrieval methods to manage attention computations. Practically, this translates into a massive reduction in computational overhead and GPU memory usage for LLMs, facilitating deployment of sophisticated models on commodity hardware — a crucial consideration for scalability and accessibility.

The performance improvements highlighted suggest a promising future for employing ANNS-based strategies in attention mechanisms. Looking ahead, potential developments could focus on:

Quantization Methods: Exploring how scalar quantization of KV vectors could further reduce CPU memory usage without sacrificing retrieval accuracy.
Dynamic Patterns and Scheduling: Refining patterns such as StreamingLLM to enhance efficiency even further or employing machine learning to adaptively refine these static patterns.
Next-Generation Retrieval Algorithms: Designing newer indices that might better handle high-dimensional sparse vectors typical in LLMs.

In conclusion, this paper provides a robust solution to a long-standing bottleneck in LLM inference, addressing both theoretical and practical components with thorough experimental validation. The RetrievalAttention method illustrates a significant step forward in the efficient deployment of large-scale transformer models, paving the way for their utilization in real-world applications with extensive context requirements.