FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (2307.08691v1)

Published 17 Jul 2023 in cs.LG

Abstract: Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in LLMing and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4$\times$ compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2$\times$ speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization).

PDF Abstract

Overview of "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning"

The paper presents FlashAttention-2, a novel approach to improve the efficiency of the attention mechanism within Transformer models. The central challenge addressed is the quadratic growth in runtime and memory associated with the attention layer as the sequence length increases, which continues to be a bottleneck in scaling models for longer contexts. FlashAttention-2 introduces enhancements over the previous FlashAttention method by optimizing work partitioning and parallelism, yielding notable speedups and higher efficiency in GPU utilization.

Key Contributions

Algorithm Optimization: The paper describes modifications to the existing algorithm that reduce the number of non-matrix multiplication FLOPs. By focusing on performing more operations with specialized GPU units (like Tensor Cores), the method significantly enhances throughput. The optimization avoids unnecessary recalculations and reduces reliance on shared memory, resulting in processing efficiencies.
Enhanced Parallelism: FlashAttention-2 extends parallel computation across sequence lengths, in addition to batch size and head count. This approach maximizes GPU resource utilization, imperative for handling long sequences efficiently, particularly when batch sizes are small. The alignment of work done across different GPU components leads to reduced latency and improved speedup.
Work Partitioning Strategy: The paper introduces a method to better distribute computational tasks between GPU threads and warps, minimizing shared memory access and facilitating faster execution. Avoiding "split-K" schemes in favor of more streamlined data handling significantly improves overall performance.

Empirical Results

The empirical validation of FlashAttention-2 demonstrates approximately a twofold improvement in speed compared to the original FlashAttention and a greater magnitude of improvement over standard implementations. The benchmarks highlight a forward and backward throughput reaching up to 73% of A100 GPU's theoretical capabilities in specific configurations. Furthermore, when implemented for training GPT-like models, FlashAttention-2 achieves a throughput of up to 225 TFLOPs/s, indicating substantial gains in training efficiency and resource utilization.

Theoretical and Practical Implications

From a theoretical perspective, FlashAttention-2 refines attention computations without resorting to approximations, maintaining exactness while enhancing speed. This precision allows for continued application in environments where accuracy remains critical, such as in LLMing and complex AI tasks.

Practically, the methodology enables scaling models to unprecedented sequence lengths, facilitating deeper context understanding in natural language processing and enabling more comprehensive analysis in image and video processing. The implications extend to streamlining the computational expense of training large models, making it feasible to run more extensive and resource-intensive AI workloads economically.

Future Directions

The paper opens avenues for further optimization of attention mechanisms on newer hardware like H100 GPUs, exploring advancements such as TMA and FP8 instructions. Additionally, integrating FlashAttention-2 with techniques like block-sparse attention could allow handling even longer contexts and more sophisticated AI applications. Collaboration with compiler research could provide automated and efficient deployment of these advanced computational techniques across diverse computing platforms.

Conclusion

FlashAttention-2 offers a robust advancement in overcoming the limitations of attention mechanisms in large-scale Transformer models. The approach promises both theoretical and practical enhancements, effectively transforming the feasibility and scope of modern AI systems that rely on understanding long sequences.