Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library (2312.11918v1)

Published 19 Dec 2023 in cs.LG and cs.DC
A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library

Abstract: We provide an optimized implementation of the forward pass of FlashAttention-2, a popular memory-aware scaled dot-product attention algorithm, as a custom fused CUDA kernel targeting NVIDIA Hopper architecture and written using the open-source CUTLASS library. In doing so, we explain the challenges and techniques involved in fusing online-softmax with back-to-back GEMM kernels, utilizing the Hopper-specific Tensor Memory Accelerator (TMA) and Warpgroup Matrix-Multiply-Accumulate (WGMMA) instructions, defining and transforming CUTLASS Layouts and Tensors, overlapping copy and GEMM operations, and choosing optimal tile sizes for the Q, K and V attention matrices while balancing the register pressure and shared memory utilization. In head-to-head benchmarks on a single H100 PCIe GPU for some common choices of hyperparameters, we observe 20-50% higher FLOPs/s over a version of FlashAttention-2 optimized for last-generation NVIDIA Ampere architecture.

Introduction to Kernel Fusion

In the area of computer programming, particularly for GPU computing, kernel fusion is an advanced technique with significant benefits. It combines multiple computational kernels into a single kernel, thereby reducing the amount of data transferred between the GPU memory and its processors. This is particularly advantageous for applications where memory bandwidth is a limiting factor, which is often the case with modern GPU hardware where computational power has outpaced memory bandwidth improvements. Amongst the workloads that benefit from kernel fusion is the training and inference of LLMs, which are foundational to major breakthroughs in AI and natural language processing.

Implementing Advanced Attention Algorithms

The focus of the paper is on optimizing the forward pass of FlashAttention-2, an attention mechanism algorithm that is essential for transformer models like GPT-3. Transforming FlashAttention-2 into a fused CUDA kernel leverages the capabilities of the NVIDIA Hopper architecture and is accomplished using the CUTLASS library, which simplifies GPU kernel development through the use of various abstractions. The paper demonstrates how to fuse operations, specifically an online-softmax function with two General Matrix Multiply (GEMM) operations, using Hopper's specialized Tensor Memory Accelerator (TMA) and Warpgroup Matrix-Multiply-Accumulate (WGMMA) instructions. It reports a 20-50% improvement in computational efficiency when compared to a previous generation architecture.

Performance Enhancements and Coding Abstractions

The implementation carefully chooses and explains the intricacies of optimal tile sizes for matrices and handles trade-offs between register pressure and shared memory utilization. The fusion of operations involves dealing with layout transformations and orchestrating data copies and computations to increase parallelism and reduce overheads. The custom kernel developed exhibits notable performance gains over previous versions of the algorithm tailored for last-generation hardware.

Directions for Future Work

Although promising results were encountered, the paper identifies potential areas for optimization that could be further explored. These include more sophisticated warpgroup use, enhanced pipelining to better overlap memory operations with computations, and utilizing new shared memory features provided by upcoming GPU architectures. The paper anticipates that future improvements to GPU hardware and attention-based algorithm implementations will continue to push the boundaries of what can be achieved in the processing efficiency and performance of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Developing CUDA Kernels for Accelerated Matrix Multiplication on NVIDIA Hopper Architecture using the CUTLASS Library. Colfax Research. 2023. https://research.colfax-intl.com/nvidia-hopper-gemm-cutlass/
  2. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. Tri Dao. July 17, 2023. https://arxiv.org/abs/2307.08691.
  3. FlashAttention — Fast and memory-efficient exact attention. https://github.com/Dao-AILab/flash-attention
  4. FlashAttention adoption. https://github.com/Dao-AILab/flash-attention/blob/main/usage.md.
  5. Setting New Records at Data Center Scale Using NVIDIA H100 GPUs and NVIDIA Quantum-2 InfiniBand. Ashraf Eassa and Sukru Burc Eryilmaz. November 8, 2023. https://developer.nvidia.com/blog/setting-new-records-at-data-center-scale-using-nvidia-h100-gpus-and-quantum-2-infiniband/.
  6. CUTLASS — CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass.
  7. CuTe Layouts. https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/01_layout.md.
  8. CuTe Tensors. https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/03_tensor.md.
  9. CuTe’s support for Matrix Multiply-Accumulate instructions. https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0t_mma_atom.md.
  10. Efficient GEMM in CUDA. https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md.
  11. NVIDIA H100 Tensor Core GPU Datasheet. https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet.
  12. NVIDIA Hopper Tuning Guide. https://docs.nvidia.com/cuda/hopper-tuning-guide/index.html
  13. Parallel Thread Execution ISA Version 8.2. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.
  14. Online normalizer calculation for softmax. Maxim Milakov and Natalia Gimelshein. July 28, 2018. https://arxiv.org/abs/1805.02867.
  15. TensorRT-LLM 0.5.0. https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0.
  16. Using Shared Memory in CUDA C/C++. Mark Harris. January 28, 2013. https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/.
  17. Faster Parallel Reductions on Kepler. Justin Luitjens. February 13, 2014. https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/.
  18. How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking - OpenAI Triton And PyTorch 2.0. Dylan Patel. January 16, 2023. https://www.semianalysis.com/p/nvidiaopenaitritonpytorch.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Ganesh Bikshandi (2 papers)
  2. Jay Shah (29 papers)
Citations (5)

HackerNews

  1. A Case Study in CUDA Kernel Fusion (1 point, 0 comments)