LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention (2502.14866v2)

Published 20 Feb 2025 in cs.CL, cs.AI, cs.DC, cs.LG, and cs.PF

Abstract: LLMs have shown remarkable potential in processing long sequences and complex reasoning tasks, yet efficiently serving these models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context and reasoning capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at https://github.com/mit-han-lab/omniserve.

Summary

The paper presents a hybrid sparse attention framework that unifies static streaming heads and dynamic KV cache pruning to accelerate long-sequence LLM serving.
It employs block-sparse CUDA kernels and hierarchical paging to achieve up to 2.9x faster prefilling and 1.3x-2.1x faster decoding compared to state-of-the-art methods.
Experimental results show that LServe preserves long-context accuracy while significantly reducing computational and memory requirements.

The paper "LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention" (2502.14866) presents a system designed to mitigate the computational and memory bottlenecks associated with serving LLMs operating on long input sequences. The primary challenges addressed are the quadratic complexity ( $O(N^2)$ ) of self-attention during the prefilling stage, where the input prompt of length $N$ is processed, and the substantial memory footprint and bandwidth demands of the Key-Value (KV) cache during the autoregressive decoding stage, which scales linearly with sequence length $S$ per token but involves accessing a potentially massive KV cache.

LServe System Architecture and Core Concept

LServe introduces a serving system optimized for long-context LLMs by leveraging a novel hybrid sparse attention mechanism. This mechanism operates within a unified framework that applies structured, hardware-friendly sparsity patterns to the attention computation, aiming to reduce the computational load in prefilling and the memory bandwidth pressure in decoding. The core idea is to skip computations on blocks of tokens deemed less important, thereby accelerating inference without significantly compromising the model's ability to utilize long-range context.

The system architecture integrates optimizations for both the prefilling and decoding phases. It builds upon QServe, inheriting support for quantization (e.g., W4A8KV4), and demonstrates that sparsity optimizations are largely orthogonal and complementary to quantization techniques. LServe processes attention computations in fixed-size blocks or tiles ( $T_Q \times T_K$ ), enabling efficient skipping of entire blocks based on pre-defined (static) or input-dependent (dynamic) sparsity criteria.

Hybrid Sparse Attention Mechanisms

A key innovation in LServe is the combination of static and dynamic sparsity patterns within a single, unified attention framework. This hybrid approach allows for multiplicative performance gains by addressing different aspects of the attention bottleneck.

Static Sparsity (Streaming Heads): LServe adapts the concept of streaming attention (similar to DuoAttention) by converting a significant fraction (e.g., 50%) of the attention heads into "streaming heads." These heads employ a fixed, $\Lambda$ -shaped attention mask, restricting attention primarily to recent tokens and a small set of initial "sink" tokens. This effectively makes the computation cost for these heads nearly constant with respect to the total sequence length. This pattern is applied during both prefilling and decoding. Specialized, fused GPU kernels are implemented to efficiently compute both the standard dense heads and the statically sparse streaming heads concurrently.
Dynamic Sparsity (Query-Centric KV Page Selection): For the remaining dense attention heads, particularly during the decoding phase, LServe employs a dynamic KV cache pruning strategy. It is based on the observation that, even for very long sequences, preserving the model's long-context capabilities often requires accessing only a relatively small, constant number of KV cache pages (e.g., corresponding to ~4096 tokens), irrespective of the total context length $S$ . LServe implements a query-centric page selection policy that dynamically identifies and retains the most relevant KV cache pages based on their similarity or importance to the current query token(s).

Implementation Details and Optimizations

Several implementation techniques are employed to realize the efficiency gains of LServe:

Unified Block Sparse Attention Kernel: The attention computation is implemented using CUDA kernels designed around block sparsity. These kernels use iterators to efficiently loop over only the necessary token blocks defined by the sparsity patterns (both static and dynamic), minimizing control flow overhead and maximizing hardware utilization.
Hierarchical Paging for Dynamic Selection: To mitigate the "page size dilemma"—where hardware efficiency favors larger KV cache page sizes (e.g., 256 tokens), but finer-grained selection improves accuracy—LServe uses a hierarchical approach. It defines smaller logical pages (e.g., 32 tokens) for more accurate importance estimation (using min/max statistics of keys) and aggregates these scores (e.g., using max) to determine the importance of the larger physical pages used for memory layout and access. This balances selection granularity with hardware efficiency.
Reusable Page Selection: Calculating the importance scores for dynamic page selection can introduce overhead, especially as it potentially scales linearly with context length. LServe exploits the temporal locality observed during decoding (queries for consecutive tokens often attend to similar parts of the context) by reusing the selected set of KV pages across a small window of decoding steps (e.g., 4 steps). This amortizes the cost of the selection process significantly.
Integration with Quantization: By building on the QServe framework, LServe naturally incorporates KV cache quantization (e.g., KV4), further reducing the memory footprint and bandwidth requirements, demonstrating the synergy between sparsity and quantization.

Experimental Evaluation

LServe was evaluated against strong baselines, including vLLM, QServe, and DuoAttention, using models like Llama-3-8B, Llama-2-7B, and Minitron-4B across various context lengths, up to 512k tokens.

Performance: The results indicate significant speedups. LServe achieved up to 2.9x faster prefilling and 1.3x-2.1x faster decoding on average compared to vLLM. The combination of static (streaming heads) and dynamic (page selection) sparsity provided multiplicative benefits over using either technique alone.
Accuracy: Importantly, these performance improvements were achieved while maintaining the long-context accuracy of the original dense models. Accuracy was measured on benchmarks like LongBench (covering various long-context tasks), Needle-in-a-Haystack (NIAH) retrieval tasks, and the RULER benchmark. LServe demonstrated comparable performance to dense attention models on these tasks, suggesting that the applied sparsity patterns effectively preserve the necessary long-range dependencies.
Constant KV Cache Claim: The paper provides empirical evidence supporting the claim that a constant number of KV cache pages (around 4096 tokens' worth) is sufficient for the dynamic selection mechanism to maintain high accuracy on long-context tasks, regardless of the actual context length.

Conclusion

LServe introduces a practical and effective approach to accelerate long-sequence LLM serving by unifying static and dynamic block sparse attention within a single system. By converting a portion of heads to efficient streaming heads and dynamically selecting a constant-sized subset of the KV cache for the remaining heads using a hierarchical paging strategy, LServe achieves substantial speedups in both prefilling (up to 2.9x) and decoding (up to 2.1x) compared to state-of-the-art systems. These gains are realized while preserving model accuracy on demanding long-context benchmarks, making LServe a promising solution for deploying LLMs with extended context windows efficiently.