Papers
Topics
Authors
Recent
Search
2000 character limit reached

RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

Published 5 May 2025 in cs.LG | (2505.02922v2)

Abstract: The growing context lengths of LLMs pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel system that reconceptualizes the key-value (KV) cache as a vector storage system which exploits the inherent attention sparsity to accelerate long-context LLM inference. At its core is the wave index, an Attention-aWare VEctor index that enables efficient and accurate retrieval of critical tokens through techniques such as tripartite attention approximation, accuracy-bounded attention estimation, and segmented clustering. Complementing this is the wave buffer, which coordinates KV cache placement and overlaps computation and data transfer across GPU and CPU to sustain high throughput. Unlike prior sparsity-based methods that struggle with token selection and hardware coordination, RetroInfer delivers robust performance without compromising model accuracy. Experiments on long-context benchmarks show up to 4.5X speedup over full attention within GPU memory limits and up to 10.5X over sparse attention baselines when KV cache is extended to CPU memory, all while preserving full-attention-level accuracy.

Summary

  • The paper presents RetroInfer, a system that redefines the key-value cache as vector storage to exploit attention sparsity for scalable LLM inference.
  • It introduces innovative techniques such as tripartite attention approximation and segmented clustering to optimize GPU-CPU memory management and reduce latency.
  • RetroInfer achieves up to 10.5× speedup and comparable accuracy to full attention, demonstrating strong scalability for long-context tasks.

The paper "RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference" (2505.02922) addresses the significant challenges in achieving efficient inference for LLMs with increasingly long context windows, primarily due to GPU memory and bandwidth limitations. The authors propose RetroInfer, a system that reconceptualizes the Key-Value (KV) cache as a vector storage system to exploit attention sparsity and accelerate inference without compromising accuracy.

At the heart of RetroInfer are two main components: the wave index and the wave buffer.

The wave index is an "Attention-aWare VEctor index" designed for accurate and low-latency retrieval of important tokens. It employs several novel techniques:

  1. Tripartite Attention Approximation: This method logically divides tokens into three zones:
    • Steady Zone: Tokens at the beginning and end of the context, which are consistently important and always included.
    • Retrieval Zone: Important tokens identified and retrieved using a cluster-based vector index (centroids represent clusters of similar key vectors). These tokens are used for precise attention computation.
    • Estimation Zone: Tokens in non-retrieved clusters whose contribution to attention is approximated using their cluster centroids and summed value vectors, ensuring coverage for varying sparsity ratios with guaranteed accuracy bounds.
  2. Accuracy-Bounded Attention Estimation: For the estimation zone, attention weights are estimated using cluster centroids. The sum of value vectors within each cluster is pre-calculated and stored, allowing for efficient approximation of their collective attention contribution. This is based on Jensen's inequality, which provides a lower bound for the sum of exponential inner product values within a cluster.
  3. Segmented Clustering: To reduce the overhead of index construction (performed during prefilling), the input sequence is divided into segments, and kk-means clustering is performed independently within each segment. This leverages the observation of coarse-grained spatial locality of key vectors.

The wave buffer manages the physical placement of KV vectors and coordinates their movement between GPU and CPU memory, orchestrating computation across these heterogeneous hardware components. Its key features include:

  1. Buffer Organization: It includes a GPU-side block cache for frequently accessed KV vectors from the retrieval zone, a buffer for the steady zone, and an execution buffer that contiguously arranges KV vectors needed for the current attention computation. The control plane (buffer manager) resides on the CPU.
  2. KV Block Cache Management:
    • Temporal Locality: Exploits the observation that important tokens often exhibit temporal locality across neighboring decoding steps.
    • Decoupled Access and Update: Cache access (determining if a cluster is in GPU cache or CPU memory) is synchronous and fast. Cache updates (replacement policy like LRU, moving data into the cache) are handled asynchronously by the CPU to minimize overhead on the critical path. Clusters are the logical unit for admission/replacement, while fixed-sized blocks are the physical unit.
  3. Assembling the Execution Buffer: Data for the execution buffer is efficiently copied in parallel from three potential sources: the steady zone (GPU-GPU), the KV block cache (GPU-GPU), and CPU memory for cache misses (CPU-GPU).
  4. Constrained Prefilling Latency: KV vectors are offloaded to CPU asynchronously during prefilling, while the GPU performs segmented clustering. The wave buffer's data structures are then built on the CPU in parallel, minimizing the impact on time-to-first-token.

Implementation Details:

  • Segmented clustering uses a custom Triton kernel.
  • Specialized CUDA kernels handle complex, non-contiguous memory copies for the execution buffer and block cache, with dynamic thread allocation.
  • FlashAttention kernels are modified to support weighted attention for the estimation zone and efficiently merge outputs from the three zones.

Evaluation:

RetroInfer was evaluated on models like Llama 3.1-8B, Qwen2.5-7B, and Llama3-8B-1048K using benchmarks such as RULER, Needle-in-a-haystack (NIAH), and LongBench.

  • Accuracy: RetroInfer achieved accuracy comparable to full attention, outperforming other sparse attention baselines (Quest, MagicPIG, InfiniGen) by 3.34%–25.37% on RULER tasks under the same retrieval budget (1.8%). It successfully handled NIAH tasks up to 1 million tokens.
  • Throughput:
    • Achieved up to 4.5×\times speedup over full attention when context fits in GPU memory.
    • Achieved up to 10.5×\times speedup over other sparse-attention systems when the KV cache is extended to CPU memory.
    • Showed strong scalability with increasing batch sizes and context lengths.
  • Prefilling Latency: Added negligible overhead (e.g., 2-7% for 120K-480K contexts) compared to full attention, which becomes insignificant for longer contexts.
  • Micro-Analysis:
    • GPU caching and asynchronous cache updates in the wave buffer significantly boosted throughput.
    • A 5% GPU cache size (relative to total KV vectors) provided a good balance between hit ratio (0.79-0.94) and memory consumption.
    • Attention estimation improved task accuracy by up to 20%.
    • Segmented clustering (e.g., 8K segment size) significantly reduced index build time (by 80%) with minimal impact on retrieval recall (<1% drop).
    • Latency breakdown showed effective overlap of CPU and GPU operations.

The paper concludes that by treating the KV cache as a vector storage system, RetroInfer effectively addresses the challenges of long-context LLM inference, delivering high throughput and maintaining full-attention-level accuracy. The code is open-sourced.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 23 likes about this paper.