Hardware-Efficient Attention for Fast Decoding (2505.21487v1)

Published 27 May 2025 in cs.LG and cs.CL

Abstract: LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decoding limits parallelism. We analyze the interplay among arithmetic intensity, parallelization, and model quality and question whether current architectures fully exploit modern hardware. This work redesigns attention to perform more computation per byte loaded from memory to maximize hardware efficiency without trading off parallel scalability. We first propose Grouped-Tied Attention (GTA), a simple variant that combines and reuses key and value states, reducing memory transfers without compromising model quality. We then introduce Grouped Latent Attention (GLA), a parallel-friendly latent attention paired with low-level optimizations for fast decoding while maintaining high model quality. Experiments show that GTA matches Grouped-Query Attention (GQA) quality while using roughly half the KV cache and that GLA matches Multi-head Latent Attention (MLA) and is easier to shard. Our optimized GLA kernel is up to 2$\times$ faster than FlashMLA, for example, in a speculative decoding setting when the query length exceeds one. Furthermore, by fetching a smaller KV cache per device, GLA reduces end-to-end latency and increases throughput in online serving benchmarks by up to 2$\times$.

PDF Abstract

Hardware-Efficient Attention for Fast Decoding

The research paper titled "Hardware-Efficient Attention for Fast Decoding" addresses the critical issue of improving inference efficiency in LLMs by redesigning the architecture of the attention mechanisms to better exploit modern hardware capabilities. The paper explores how current architectures can be optimized to reduce memory movement and increase computational parallelization during the decoding stage.

Key Contributions and Methodologies

The paper introduces two novel attention mechanisms—Grouped-Tied Attention (GTA) and Grouped Latent Attention (GLA). Both are designed to enhance hardware efficiency without sacrificing model quality or scalability:

Grouped-Tied Attention (GTA): This variant ties the key and value (KV) representations into a single shared state. By consolidating them, GTA reduces the size of the KV cache and improves arithmetic intensity by reusing these states across groups of query heads. GTA reportedly improves performance and quality metrics similar to Grouped-Query Attention (GQA) while providing significant reductions in memory usage.
Grouped Latent Attention (GLA): This mechanism extends latent attention in a parallel-friendly manner. It introduces optimizations that make sharding across multiple devices more efficient. By compressing the hidden state into latent vectors and caching fewer large-dimension heads, GLA can maintain high inference quality similar to Multi-head Latent Attention (MLA), but at a reduced computational and memory overhead.

Numerical Results and Implications

The numerical results from experiments conducted on LLMing tasks demonstrate significant quantitative improvements:

GTA halves the KV cache size compared to its GQA counterpart while maintaining comparable perplexity levels. For example, in a large-scale model with 876M parameters, GTA achieves a validation perplexity of 11.2 compared to GQA's 11.3.
GLA achieves up to twice the speed of similar MLA configurations during speculative decoding scenarios, indicating its superior efficiency in handling larger batch sizes and longer sequence lengths.
For online serving benchmarks, GLA showed improvements in end-to-end latency and throughput by up to two times compared to FlashMLA, exemplifying its suitability for latency-sensitive, high-throughput tasks.

Theoretical and Practical Implications

The implications of these advancements are profound both theoretically and practically. Theoretically, the paper strengthens the understanding of how architectural designs can leverage hardware features more efficiently, such as increased arithmetic intensity by minimizing memory movement. Practically, the proposed attention mechanisms promise better scalability and reduced inference costs, which is crucial for deploying LLMs in real-world applications where latency and resource utilization are pivotal concerns.

Future Directions

Future research can explore the scalability of these mechanisms in even larger models and their adaptation to various hardware architectures beyond GPUs, such as specialized tensor processing units (TPUs). Additionally, investigating the balance between model quality and cache size reduction across different types of attention architectures could yield insights to optimize other neural network components or even new models altogether.

In conclusion, this paper provides essential advancements in the design of efficient attention mechanisms, offering a robust foundation for subsequent research and development in AI models that must contend with hardware limitations.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Ted Zadouri (3 papers)
Hubert Strauss (2 papers)
Tri Dao (47 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/fleetwood___/status/1928763943532564597

https://twitter.com/tedzadouri/status/1928167535377019299

https://twitter.com/hillbig/status/1929304710944878782

https://twitter.com/rohanpaul_ai/status/1928205323841487026

https://twitter.com/HritikGuptaAI/status/1935999008314474618

https://twitter.com/hillbig/status/1929308677254586563

YouTube

Show All Videos