FlashAttention-2: Optimized Transformer Attention

Updated 3 August 2025

FlashAttention-2 is an optimized, IO-aware exact attention algorithm for Transformers that minimizes memory traffic and maximizes GPU throughput.
It employs block-wise memory tiling, deferred normalization, and activation recomputation to significantly reduce non-matmul operations and memory footprint.
Advanced parallelism and kernel fusion techniques yield up to 2× speedup, closing the gap between actual performance and hardware peak.

FlashAttention-2 is an optimized, IO-aware exact attention algorithm designed to accelerate and scale the core attention layer in Transformer architectures by minimizing memory traffic and maximizing computational throughput on GPU hardware. Representing the successor to the original FlashAttention algorithm, FlashAttention-2 systematically reduces non-matmul overhead and introduces advanced parallelization strategies, yielding significant speedups and closing the gap between practical attention kernel efficiency and theoretical hardware peak performance.

FlashAttention-2 restructures the self-attention computation by enhancing both the arithmetic and IO characteristics of the forward and backward passes. The attention kernel is implemented block-wise, partitioning the query (Q), key (K), and value (V) matrices into memory-friendly tiles that reside wholly in on-chip SRAM, thus minimizing the number of reads and writes to high-bandwidth memory (HBM).

Central algorithmic advances include:

Deferring normalization: The algorithm accumulates an “unscaled” output tensor by processing each K/V block, postponing all softmax normalization until the end of the loop. This change eliminates expensive rescaling operations performed at each block update, substantially reducing the number of non-matmul FLOPs executed.
Simplified backward storage: FlashAttention-2 stores only the row-wise logsumexp (L = m + log(l)), rather than both the per-row maximum and sum of exponentials, for backpropagation. This further cuts memory traffic and non-matmul computational cost.
Recomputation for memory savings: The backward pass reconstructs activations on the fly using stored normalization statistics and the input blocks streamed from HBM, avoiding the need to store (or repeatedly read) the full N×N attention or softmax matrix.

Theoretical IO complexity is thus reduced from Ω(N²) (naive implementation) to Θ(N² d² / M) HBM transfers for standard attention, where N is the sequence length, d the head dimension, and M the on-chip SRAM capacity.

2. Advanced Parallelism and Work Partitioning

FlashAttention-2 introduces new levels of parallelism, improving GPU occupancy and balance of work across cores/warps; this is critical for closing the efficiency gap with optimized GEMM routines.

Inter-block parallelism: For the forward pass, multiple thread blocks are assigned distinct, non-overlapping blocks of queries, so all blocks operate independently.
Backward pass concurrency: Gradient computation for Q (dQ) uses atomic operations to support update of shared rows, with each thread block responsible for a chunk of columns. This balances workload even when model or tile dimensions are not large.
Intra-block warp partitioning: The new intra-block scheduling scheme splits query slices across warps (keeping K, V shared), so each warp can independently compute its output without excess shared memory communication or expensive synchronization.
Improved occupancy: These design choices ensure high streaming multiprocessor (SM) utilization, especially crucial when the batch size or number of heads is small, overcoming one limiting factor in prior algorithms.

3. Integration with Hardware and Fused Kernel Implementations

On recent GPU architectures (e.g., NVIDIA Hopper), FlashAttention-2 implementations leverage features such as the Tensor Memory Accelerator (TMA) for asynchronous global-to-shared memory copies and Warpgroup Matrix Multiply-Accumulate (WGMMA) instructions for fused matmul operations.

Key hardware-level strategies include:

Kernel fusion: QKᵀ GEMM, online row-wise softmax (with per-row max and row-sum), and PV GEMM are fused into a single kernel, minimizing intermediate memory writes.
Asynchronous copy–compute overlap: TMA enables concurrent loading of the next block (from global to shared memory) during ongoing GEMM/softmax of the current block.
Layout transformation: The CUTLASS library defines custom matrix layouts for Q, K, V allowing efficient tile-wise transposition and in-register reshaping between consecutive GEMM operations.
Tile tuning: Careful selection of Q/K/V tile shapes (e.g., 64×128, 128×64) balances computational throughput, register usage, and occupancy, avoiding register overflow or underutilization.

On Hopper hardware, these optimizations provide an additional 20‒50% FLOPs/s improvement over prior SM80/Ampere generation kernels (Bikshandi et al., 2023).

4. Performance Metrics and Empirical Speedups

FlashAttention-2 achieves substantial empirical performance improvements on language modeling and sequence modeling tasks. Benchmark results include:

2× speedup over the original FlashAttention kernel across a range of sequence lengths.
Attainment of 230 TFLOPs/s combined (forward and backward) on NVIDIA A100, corresponding to approximately 73% of the theoretical device peak for matmul FLOPs (previous FlashAttention achieved only 25–40%).
For net model training, GPT-style models up to 2.7B parameters on 8k-context can be trained at up to 225 TFLOPs/s per A100 GPU, with up to 1.3× speedup over prior FlashAttention and up to 2.8× over non-FlashAttention baselines (Dao, 2023).
The kernel gets close to the efficiency of GEMM routines, which typically reach ~80–90% of hardware peak with hand-optimized implementations.

These metrics also indicate that the throughput gap between theory (hardware peak) and practice (software kernels) has been substantially closed vis-à-vis prior art.

5. Theoretical Optimality and I/O Complexity

FlashAttention-2's design is guided by formal IO complexity bounds. For ordinary cache size regimes (M ≫ d²), the achieved IO complexity of Θ(N² d² / M) is proven to be optimal up to constant or polylogarithmic factors (Saha et al., 12 Feb 2024). For small caches (M = o(d²)), the lower bound becomes Ω(N² d / √M), indicating that FlashAttention's benefits are intrinsic to the large-cache regime in modern GPUs.

Recent theoretical results leverage advances from communication complexity to establish that even with the most aggressive matrix multiplication or communication protocols, no attention algorithm can surpass this IO complexity lower bound in the relevant regimes.

6. Extensions, Applications, and Ecosystem

The modular work partitioning and hardware-friendly structure of FlashAttention-2 underpin compatibility with:

Dynamic sparsity patterns (“QK-sparse” and “hash-sparse”): Extends the original tiling scheme to irregular attention masks, as required by dynamic or hashed attention (Pagliardini et al., 2023).
Block-sparse attention and S2-Attention: Further reduces computation by context sharding among heads, achieving up to 25.3× speed-up versus dense FlashAttention-2 baselines without loss of downstream model quality (Lin et al., 25 Jul 2024).
Third-party frameworks and models: Integrated as core infrastructure in open-source models (TinyLlama (Zhang et al., 4 Jan 2024)) and advanced LLM engineering stacks (QiMeng-Attention (Zhou et al., 14 Jun 2025)).
Specialized low-memory settings (NPUs, Volta GPUs): Adopted and re-architected for low-resource environments, including support for ultra-long sequences and CPU-GPU cooperative mode (Lin et al., 22 Oct 2024).
Downstream domains: Enabled efficient, scalable attention for high-resolution vision tasks (ELFATT (Wu et al., 10 Jan 2025)) and for sophisticated 3D learning backbones with geometric locality alignment (Flash3D (Chen et al., 21 Dec 2024)).

7. Impact, Limitations, and Future Directions

FlashAttention-2 marks a substantial advance in the design of efficient, hardware-optimized attention operators for foundational models and very long context sequences. Its IO-optimal design, coupled with GPU-centric parallelism and kernel fusion, dramatically boosts throughput and unlocks practical scaling for both dense and structured sparse attention patterns.

Remaining limitations include the lack of hardware-accelerated nonlinear function support (such as exponentiation in softmax) on some accelerators—addressed in follow-on work with ISA extensions (Wang et al., 15 Apr 2025)—and the need for further speedups in the context of extremely long sequences and generation-phase scenarios (the focus of follow-on designs such as LeanAttention (Sanovar et al., 17 May 2024) and FlashAttention-3 (Shah et al., 11 Jul 2024)).

Future research directions likely include: further fusion of quantization (see MiniKV (Sharma et al., 27 Nov 2024) and TurboAttention (Kang et al., 11 Dec 2024)), expansion of support to non-GPU accelerators, and the development of flexible, LLM-driven attention optimization pipelines (QiMeng-Attention (Zhou et al., 14 Jun 2025)). FlashAttention-2 provides the canonical kernel reference and performance baseline for these next-generation transformer architectures and their deployment at scale.