Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers (2405.10480v1)

Published 17 May 2024 in cs.AR and cs.LG
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Abstract: Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching billions of parameters. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attention operation is quadratic in terms of the total context length, i.e., prompt and output tokens. Thus, several optimizations such as key-value tensor caching and FlashAttention computation have been proposed to deliver the low latency demands of applications relying on such large models. However, these techniques do not cater to the computationally distinct nature of different phases during inference. To that end, we propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase (decode-phase) of decoder-only transformer models. LeanAttention enables scaling the attention mechanism implementation for the challenging case of long context lengths by re-designing the execution flow for the decode-phase. We identify that the associative property of online softmax can be treated as a reduction operation thus allowing us to parallelize the attention computation over these large context lengths. We extend the "stream-K" style reduction of tiled calculation to self-attention to enable parallel computation resulting in an average of 2.6x attention execution speedup over FlashAttention-2 and up to 8.33x speedup for 512k context lengths.

LeanAttention: Speeding Up Attention Mechanisms for Transformer Models

Background

In the field of AI and NLP, transformer-based models have become extremely valuable due to their performance on tasks such as text generation, machine translation, and sentiment analysis. These transformers, with self-attention mechanisms at their heart, require significant memory and computation power, particularly as contexts lengthen and models scale up to billions of parameters.

The Challenge

The real bottleneck in transformer models arises during the attention mechanism, especially when dealing with lengthy contexts. Execution time and memory usage for this process grow quadratically with sequence length. To mitigate this problem, approaches like FlashAttention and FlashAttention-2 optimize memory and computational efficiency. However, these techniques often overlook the unique computational demands of the two main phases of transformer inference: the prefill phase (where the model processes the input prompt) and the decode phase (where the model generates tokens sequentially).

Introducing LeanAttention

LeanAttention is proposed as an innovative technique to optimize the attention mechanism during the decode phase of decoder-only transformer models. The decode phase is notoriously challenging due to its sequential nature and the necessity to handle extensive context lengths efficiently. LeanAttention exploits the associative properties of the attention mechanism to achieve significant performance gains.

Critical Concepts Behind LeanAttention

1. Softmax Re-scaling as a Reduction Operation

LeanAttention reconceptualizes the softmax operation involved in attention as a form of reduction. By treating re-scaling of un-scaled attention output tensors as an associative reduction operation, LeanAttention enables parallel computation over large context lengths. This is crucial for handling decode-phase workloads effectively.

2. Stream-K Style Decomposition

Taking inspiration from optimized matrix multiplication strategies in GPU computing, LeanAttention divides the attention tasks into minimal computational units called LeanTiles. It then efficiently distributes these lean tiles across the available processing units, ensuring balanced workloads and maximizing hardware utilization. Unlike previous methods, LeanAttention maintains near 100% GPU occupancy regardless of the problem size.

Performance Gains

LeanAttention has demonstrated impressive speedups in the attention execution process. In benchmark tests, LeanAttention achieved:

  • An average speedup of 2.6x over FlashAttention-2.
  • Up to 8.33x speedup for very long context lengths (512k tokens).

Additionally, in multi-GPU environments, it showed even greater gains, confirming its scalability and efficiency for large-scale AI models.

Practical and Theoretical Implications

Practical Implications

LeanAttention's capability to handle lengthy contexts efficiently means that transformer-based models can now support richer, more coherent interactions. This improvement is particularly beneficial for applications requiring long contextual understanding, such as document search and retrieval, dialogue systems, and large-scale content generation.

  • Reduced Latency: By cutting down the execution time, LeanAttention facilitates faster responses in real-time applications.
  • Enhanced Scalability: It supports the ongoing trend of increasing model sizes and context lengths without compromising performance.

Theoretical Implications

From a theoretical perspective, LeanAttention contributes to the understanding of how matrix decomposition and associative properties in computations can be leveraged to optimize complex machine learning operations. It suggests directions for further research, such as exploring similar optimizations for other phases of model inference or extending these principles to other types of models.

What's Next?

LeanAttention opens up numerous avenues for future research and development in AI:

  • Integrating with Larger Models: Applying LeanAttention to even larger transformers and comparing its performance across different architectures.
  • Extending to Other Phases: Investigating how the associative property and Stream-K decomposition can optimize other inference phases or even training processes.
  • Multi-GPU and Distributed Systems: Further optimizing LeanAttention for more complex hardware setups, enabling seamless scalability across distributed systems.

Conclusion

LeanAttention presents a significant advance in the efficient execution of the attention mechanism within transformer-based models, particularly during the decode phase. By rethinking the softmax operation and utilizing a novel decomposition strategy, LeanAttention offers substantial performance improvements and provides a robust framework for scaling up transformer models in the future. Whether you're working with extensive text generation tasks or building models that require deep contextual understanding, LeanAttention is a step forward in making large-scale NLP more efficient and scalable.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. “Cute layouts.” https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/01_layout.md, [Accessed 19-04-2024].
  2. “Cute tensors.” https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/03_tensor.md., [Accessed 19-04-2024].
  3. “Cute’s support for matrix multiply-accumulate instructions.” https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0t_mma_atom.md, [Accessed 19-04-2024].
  4. “Flashdecoding: Stanford CRFM — crfm.stanford.edu,” https://crfm.stanford.edu/2023/10/12/flashdecoding.html, [Accessed 22-04-2024].
  5. “GitHub - Dao-AILab/flash-attention: Fast and memory-efficient exact attention — github.com,” https://github.com/Dao-AILab/flash-attention, [Accessed 19-04-2024].
  6. “GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines — github.com,” https://github.com/NVIDIA/cutlass, [Accessed 01-04-2024].
  7. “Introducing ChatGPT — openai.com,” https://openai.com/blog/chatgpt, [Accessed 01-04-2024].
  8. “Introducing the next generation of claude,” https://www.anthropic.com/news/claude-3-family, [Accessed 19-04-2024].
  9. W. Brandon, A. Nrusimha, K. Qian, Z. Ankner, T. Jin, Z. Song, and J. Ragan-Kelley, “Striped attention: Faster ring attention for causal transformers,” 2023.
  10. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  11. J. Choquette, E. Lee, R. Krashinsky, V. Balan, and B. Khailany, “3.2 the a100 datacenter gpu and ampere architecture,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64.   IEEE, 2021, pp. 48–50.
  12. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
  13. T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,” 2023.
  14. T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” 2022.
  15. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019.
  16. Y. Fu, R. Panda, X. Niu, X. Yue, H. Hajishirzi, Y. Kim, and H. Peng, “Data engineering for scaling language models to 128k context,” arXiv preprint arXiv:2402.10171, 2024.
  17. K. Hong, G. Dai, J. Xu, Q. Mao, X. Li, J. Liu, K. Chen, Y. Dong, and Y. Wang, “Flashdecoding++: Faster large language model inference on gpus,” 2024.
  18. G. Inc., “An important next step on our AI journey — blog.google,” https://blog.google/technology/ai/bard-google-ai-search-updates/, [Accessed 31-03-2024].
  19. A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, “Data movement is all you need: A case study on optimizing transformers,” Proceedings of Machine Learning and Systems, vol. 3, pp. 711–732, 2021.
  20. Z. Jia and P. Van Sandt, “Dissecting the ampere gpu architecture via microbenchmarking,” in GPU Technology Conference, 2021.
  21. S. Kim, C. Hooper, T. Wattanawong, M. Kang, R. Yan, H. Genc, G. Dinh, Q. Huang, K. Keutzer, M. W. Mahoney et al., “Full stack optimization of transformer inference: a survey,” arXiv preprint arXiv:2302.14017, 2023.
  22. T. Li, G. Zhang, Q. D. Do, X. Yue, and W. Chen, “Long-context llms struggle with long in-context learning,” 2024.
  23. Y. Li, S. Bubeck, R. Eldan, A. D. Giorno, S. Gunasekar, and Y. T. Lee, “Textbooks are all you need ii: phi-1.5 technical report,” 2023.
  24. H. Liu, M. Zaharia, and P. Abbeel, “Ring attention with blockwise transformers for near-infinite context,” 2023.
  25. M. Milakov and N. Gimelshein, “Online normalizer calculation for softmax,” 2018.
  26. OpenAI, J. Achiam, and S. Adler, “Gpt-4 technical report,” 2024.
  27. M. Osama, D. Merrill, C. Cecka, M. Garland, and J. D. Owens, “Stream-k: Work-centric parallel decomposition for dense matrix-matrix multiplication on the gpu,” 2023.
  28. R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
  29. J. Spataro and M. Inc., “Introducing Microsoft 365 Copilot – your copilot for work - The Official Microsoft Blog — blogs.microsoft.com,” https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot/-your-copilot-for-work/, [Accessed 31-03-2024].
  30. H. Touvron and L. Martin, “Llama 2: Open foundation and fine-tuned chat models,” 2023.
  31. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023.
  32. S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
  33. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Huggingface’s transformers: State-of-the-art natural language processing,” 2020.
  34. G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 521–538.
  35. G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22).   Carlsbad, CA: USENIX Association, Jul. 2022, pp. 521–538. [Online]. Available: https://www.usenix.org/conference/osdi22/presentation/yu
  36. P. Zhang, Z. Liu, S. Xiao, N. Shao, Q. Ye, and Z. Dou, “Soaring from 4k to 400k: Extending llm’s context with activation beacon,” arXiv preprint arXiv:2401.03462, 2024.
  37. S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Rya Sanovar (1 paper)
  2. Srikant Bharadwaj (6 papers)
  3. Renee St. Amant (9 papers)
  4. Victor Rühle (18 papers)
  5. Saravan Rajmohan (85 papers)
Citations (4)