Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads (2410.10819v1)

Published 14 Oct 2024 in cs.CL

Abstract: Deploying long-context LLMs is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks--referred to as Streaming Heads--do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU. Code is provided in https://github.com/mit-han-lab/duo-attention.

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

The paper presents DuoAttention, a novel framework designed to enhance the efficiency of LLM inference in long-context applications. By addressing significant computational and memory challenges commonly associated with deploying LLMs, DuoAttention introduces an innovative approach that segregates attention heads into two categories: Retrieval Heads and Streaming Heads.

Key Insights and Methodology

The primary observation driving this research is that only a subset of attention heads—termed Retrieval Heads—are essential for processing long contexts and necessitate full attention across all tokens. In contrast, most attention heads can focus on recent tokens and attention sinks—identified as Streaming Heads—without requiring a comprehensive KV cache. This distinction allows for significant optimization in large-scale LLM deployments.

DuoAttention employs a lightweight optimization-based algorithm to accurately identify retrieval heads using synthetic data. The algorithm optimizes gate values associated with each head, determining which require full attention. This differentiation enables a reduced KV cache size and decreased memory consumption without compromising the LLM's ability to manage extended contexts.

Numerical Results

Empirical evaluations demonstrate that DuoAttention achieves substantial reductions in memory usage and decoding latency:

  • Memory reductions of up to 2.55×\times for Multi-Head Attention (MHA) models and 1.67×\times for Grouped-Query Attention (GQA) models.
  • Decoding speed improvements of up to 2.18×\times for MHA and 1.50×\times for GQA.
  • Pre-filling acceleration by up to 1.73×\times and 1.63×\times for MHA and GQA models, respectively.

These results confirm that DuoAttention effectively balances efficiency and performance, maintaining minimal accuracy loss compared to models utilizing full attention.

Implications and Future Directions

DuoAttention has significant implications for practical applications that rely on processing extensive sequences, such as document summarization and complex dialogue systems. By reducing the computational load and memory requirements, this research facilitates the deployment of LLMs in resource-constrained environments.

The framework is fully compatible with existing optimization techniques like quantization, further enhancing its applicability. For instance, when combined with quantization, DuoAttention enables the Llama-3-8B model to manage up to 3.3 million contextual tokens on a single A100 GPU, markedly increasing its capacity and efficiency.

Future research could explore the application of DuoAttention in different architectures and its integration with emerging LLM enhancements. By continuously advancing the boundaries of model efficiency, the paper paves the way for more scalable and accessible large-scale AI systems.

Conclusion

DuoAttention represents a focused advancement in the domain of LLMs, offering a strategic solution to the complexities of long-context inference. By judiciously leveraging the distinct roles of retrieval and streaming heads, it provides a pathway to significantly improved efficiency and practicality in real-world AI applications. As the field evolves, the principles outlined in this work will likely inform subsequent innovations and optimizations, underscoring the ongoing efforts to refine and enhance large-scale AI technologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Guangxuan Xiao (16 papers)
  2. Jiaming Tang (8 papers)
  3. Jingwei Zuo (12 papers)
  4. Junxian Guo (6 papers)
  5. Shang Yang (12 papers)
  6. Haotian Tang (28 papers)
  7. Yao Fu (83 papers)
  8. Song Han (155 papers)
Citations (6)
Youtube Logo Streamline Icon: https://streamlinehq.com