HashAttention: Semantic Sparsity for Faster Inference

Published 19 Dec 2024 in cs.LG and cs.AI | (2412.14468v2)

Abstract: Leveraging long contexts is crucial for advanced AI systems, but attention computation poses a scalability challenge. While scaled dot-product attention (SDPA) exhibits token sparsity, i.e. only a few pivotal tokens significantly contribute to output, exploiting this sparsity remains challenging. Existing methods either suffer from quality degradation or require substantial additional resources. We show that identifying pivotal tokens is a Maximum Inner Product Search (MIPS) problem. However, existing MIPS solutions are not well-suited for SDPA, as they are not GPU-friendly and often underperform due to the separated query and key distributions. This paper introduces HashAttention, framing pivotal token identification as a recommendation problem. Given a query, HashAttention encodes keys and queries in Hamming space, capturing the required semantic similarity, using learned mapping functions. HashAttention efficiently identifies pivotal tokens for a given query using bitwise operations and computes attention using only these tokens, improving the overall attention efficiency. Trained on generic data, HashAttention reduces tokens used by up to $16\times$ with minimal quality loss, requiring only 32 bits of auxiliary memory per token. Sparsity can be further improved to $32\times$ through task-specific fine-tuning. On A100 GPU, at $32\times$ sparsity, incorporating HashAttention reduces attention latency by up to $4.3\times$ in GPT-FAST and $2.54\times$ in FlashDecode, and achieves up to $3.12\times$ higher throughput for GPT-FAST.

Abstract PDF HTML Upgrade to Chat

Authors (6)

Summary

The paper introduces HashAttention, a novel method leveraging semantic sparsity to efficiently identify pivotal tokens in long-context LLMs, framing it as a recommendation problem in Hamming space.
HashAttention achieved up to 32x token reduction and 3-6x speed-up on Llama-3.1-8B with minimal quality loss, using only 32 bits of auxiliary memory per token.
This approach offers significant improvements in memory efficiency and computational speed, enabling more efficient LLM deployment, especially in resource-constrained environments requiring long contexts.

HashAttention: Semantic Sparsity for Faster Inference

The paper "HashAttention: Semantic Sparsity for Faster Inference" presents a novel approach to enhance the efficiency of LLMs by efficiently exploiting the inherent sparsity in scaled dot-product attention (SDPA). As LLMs increasingly adopt longer contexts to improve performance, the computational cost escalates due to the quadratic complexity of SDPA. The authors propose HashAttention, a method that utilizes semantic sparsity for faster inference without significant degradation in model performance.

Methodology

HashAttention is based on the observation that not all tokens within long contexts are equally important for computing attention. Instead of attending equally to all tokens, HashAttention identifies a subset of pivotal tokens that meaningfully influence the attention computation. By framing this task as a recommendation problem, the authors leverage the principles of information retrieval to improve efficiency.

HashAttention employs learnable mapping functions to encode both keys and queries into a Hamming space, where semantic similarity is represented in terms of Hamming distance. The key innovation is performing retrieval in this Hamming space using simple bitwise operations, allowing rapid identification of crucial tokens that contribute to the attention computation. This process significantly reduces the computational load by focusing only on the most relevant tokens.

Key Findings and Results

The approach was evaluated using the Llama-3.1-8B model on the LongBench benchmark. The results are notable:

HashAttention achieves up to 32x reduction in the number of tokens processed while maintaining an average quality loss within 0.6 points.
At this level of compression, HashAttention demonstrated a 3 to 6 times speed-up over LightLLM and a 2.5 to 4.5 times speed-up over gpt-fast on an Nvidia-L4 GPU.
The method requires only 32 bits of auxiliary memory per token, a notable reduction compared to other approaches.

The experimental results indicate that HashAttention can almost match the attention quality of the full model under extreme sparsity conditions that other methods cannot efficiently sustain.

Implications and Future Work

HashAttention’s approach has both practical and theoretical implications. Practically, it provides a path to significantly more efficient LLM deployment across various applications, from text processing to real-time communication agents, where long-context understanding is crucial. Theoretical implications involve exploring a deeper understanding of semantic similarity in attention mechanisms and how it can be exploited for efficiency.

Future research could explore extending HashAttention to include value-based token importance, enhancing the method's generalizability and exploring end-to-end trainable models incorporating HashAttention inherently, as opposed to adapted models.

In summary, this paper proposes a highly efficient method for leveraging semantic sparsity in modern attention mechanisms, offering substantial improvements in both memory efficiency and computational speed. This direction could potentially redefine how attention mechanisms are implemented, especially in resource-constrained environments.

Markdown Report Issue