Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Squeezed Attention: Accelerating Long Context Length LLM Inference (2411.09688v2)

Published 14 Nov 2024 in cs.CL

Abstract: Emerging LLM applications require long input prompts to perform complex downstream tasks like document analysis and code generation. For these long context length applications, the length of the input prompt poses a significant challenge in terms of inference efficiency since the inference costs increase linearly with sequence length. However, for many of these applications, much of the context in the prompt is fixed across different user inputs, thereby providing the opportunity to perform offline optimizations to process user inputs quickly, as they are received. In this work, we propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We first leverage K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. During inference, we compare query tokens from the user input with the centroids to predict which of the keys from the fixed context are semantically relevant and need to be loaded during inference. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs. We also extend our method to use a hierarchical centroid lookup to identify important keys, which can reduce the complexity of attention from linear to logarithmic with respect to the context length. We implement optimized Triton kernels for centroid comparison and sparse FlashAttention with important keys, achieving more than 4x speedups during both the prefill and generation phases for long-context inference. Furthermore, we have extensively evaluated our method on various long-context benchmarks including LongBench, where it achieves a 3x reduction in KV cache budget without accuracy loss and up to an 8x reduction with <0.5 point accuracy gap for various models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Coleman Hooper (16 papers)
  2. Sehoon Kim (30 papers)
  3. Hiva Mohammadzadeh (3 papers)
  4. Monishwaran Maheswaran (2 papers)
  5. June Paik (1 paper)
  6. Michael W. Mahoney (233 papers)
  7. Kurt Keutzer (200 papers)
  8. Amir Gholami (60 papers)
Citations (2)

Summary

Accelerating Long Context Length LLM Inference

The advancement of LLMs has facilitated their application in a variety of long-context tasks, such as document analysis and code generation. However, the inference efficiency with these models degrades as the length of input prompts increases. In the context of fixed settings, much of this input prompt remains unchanged across different user queries, presenting an opportunity for offline optimization.

This paper introduces a novel approach to accelerate inference in LLMs dealing with long-context applications, specifically those involving a substantial fixed context. The authors propose clustering fixed context keys using K-means based on semantic similarity and employing centroid representation for efficient query-token processing. This strategy significantly reduces the computational overhead required for attention mechanisms by only including relevant keys in the attention computation.

Methodology

The proposed method operates in two primary phases:

  1. Offline Clustering: The fixed context keys are clustered offline into representative centroids using K-means clustering. This step ensures that similar keys are grouped, allowing for a reduction in the number of keys that need to be compared during the inference phase.
  2. Online Inference: During inference, query tokens are compared against these centroids to determine the relevant keys. This comparison informs which keys should be loaded for exact attention computation, rather than processing all context keys. Moreover, the authors introduce a hierarchical mechanism that allows further efficiency improvements by reducing the complexity of attention from linear to logarithmic concerning the fixed context length.

Numerical Results

The approach demonstrates significant acceleration in LLM inference while maintaining accuracy. Notable results include over 4x speedups during both prefill and generation phases for long-context inference tasks. The method also achieves a 3.1x reduction in KV cache budget on benchmarks like LongBench without accuracy loss. For applications tolerating minor accuracy degradation, an 8x reduction is possible with a less than 0.5 percentage point gap in accuracy across various models, including LLaMA-2 and LongChat.

Implications and Future Directions

This work presents promising implications for the application of LLMs in scenarios requiring long-context analysis by offering a practical solution to the computational challenges therein. The optimization of the attention mechanism for such models is crucial as LLMs continue to expand their context capabilities.

Theoretically, this research highlights the importance of semantic clustering in attention mechanisms, paving the way for further exploration into dynamic context retrieval strategies. Moreover, a prospective development could involve automating the configuration of clustering parameters based on desired accuracy levels and context length, enhancing the adaptability of this method across diverse applications.

In summary, the proposed method effectively mitigates computational and memory overhead in LLMs with long contexts through semantic-based clustering and centroid-based lookup. This work not only provides substantial efficiency improvements but also maintains, if not enhances, the performance of LLMs in long-context application domains. Future advancements may see the integration of more sophisticated clustering techniques and real-time adaptability to further bolster this methodology's impact on LLM efficiency.

Youtube Logo Streamline Icon: https://streamlinehq.com