Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Flash-Attention Kernels in Transformer Models

Updated 2 September 2025
  • Flash-Attention kernels are GPU-optimized algorithms that tile and fuse attention computations to minimize memory IO and accelerate softmax-based self-attention in Transformers.
  • They integrate score computation, softmax normalization, and value aggregation into on-chip operations, achieving optimal IO complexity and noticeable speedups.
  • Extensions such as block-sparse attention and quantized implementations enable longer context processing and broader hardware compatibility while maintaining accuracy.

Flash-Attention kernels are a family of GPU-optimized attention algorithms and implementations designed to accelerate the computation of softmax-based self-attention in Transformer models. The canonical FlashAttention kernel achieves this by reorganizing and tiling the attention computation such that the memory and computational complexity per GPU pass is minimized, particularly with respect to IO between high-bandwidth memory (HBM) and on-chip scratchpad SRAM. By fusing the computation of attention scores, softmax normalization, and value-weighted sums into hardware-aware blocks, FlashAttention delivers both theoretically optimal IO complexity for a variety of practical SRAM sizes and significant empirical speedups—without sacrificing the exactness of softmax attention. Extensions to the kernel framework provide further speed and memory improvements for block-sparse attention, quantized data types, and new hardware architectures.

1. Algorithmic Foundations and Tiled Attention Computation

Traditional (dense) softmax-based self-attention for a sequence of length NN and head dimension dd computes a N×NN \times N matrix of attention scores S=QKS = QK^\top, reduces each row via softmax (numerically stabilized as Pi,:=exp(Si,:mi)/iP_{i,:} = \exp(S_{i,:} - m_i)/\ell_i, with mi=maxjSi,jm_i = \max_j S_{i,j} and i=jexp(Si,jmi)\ell_i = \sum_j \exp(S_{i,j} - m_i)), and computes the output O=PVO = PV. This construction incurs O(N2)O(N^2) memory and compute, particularly when SS must be explicitly materialized in large matrices stored in GPU HBM.

FlashAttention (Dao et al., 2022) eliminates the need to materialize SS (and the normalized matrix PP) in global memory by introducing tiling: QQ, KK, and VV are partitioned into submatrices, and the attention computation is performed block-by-block using only on-chip SRAM. This is feasible due to the decomposability of the softmax, which allows incremental and numerically stable aggregation of partial softmax normalizers and outputs:

  • Each query block QiQ_i is matched in turn with key and value blocks KjK_j, VjV_j, with the partial scores Sij=QiKjS_{ij} = Q_i K_j^\top computed and normalized.
  • Running softmax statistics—block-wise row maxima mim_i and sums i\ell_i—are maintained and incrementally updated using the algebra of the exponential and max operators:

mtotal=max(mprev,mcurr),m_{\text{total}} = \max(m_{\text{prev}}, m_{\text{curr}}),

total=exp(mprevmtotal)prev+exp(mcurrmtotal)curr.\ell_{\text{total}} = \exp(m_{\text{prev}} - m_{\text{total}})\ell_{\text{prev}} + \exp(m_{\text{curr}} - m_{\text{total}})\ell_{\text{curr}}.

  • Final outputs are updated via an aggregation step, fusing computations across blocks without intermediate writes of SS or PP to slow memory.

This approach is formalized in the algorithmic pseudocode and block algebra in (Dao et al., 2022), where the precise goal is to never materialize the N×NN \times N attention matrix in HBM, instead updating the output via local SARAM-resident computation.

2. IO-Awareness and Memory Complexity

The principal innovation in FlashAttention is explicit IO-aware algorithm design. Analysis demonstrates that with input tiling and on-chip SRAM of capacity MM, the number of HBM accesses is reduced from O(N2)O(N^2) for conventional attention to O(N2d2/M)O(N^2 d^2 / M) (with dd the head dimension). This is optimal for a broad range of MM, allowing each block to be loaded once into fast memory, reused maximally, and only the output and minimal softmax statistics are written out per kernel pass.

This IO (rather than FLOPs) is the main cost for attention on modern GPUs. Traditional methods, including most “approximate” attention algorithms, often reduce individual matrix operation complexity but are bottlenecked by memory traffic. By treating reads and writes as a first-class design objective and balancing shared memory utilization, FlashAttention universally reduces wall-clock runtime even when all computations remain exact.

3. Performance Metrics and Empirical Improvements

In head-to-head empirical evaluations, FlashAttention and IO-aware extensions demonstrate consistent speedups:

  • BERT-large (sequence length 512): 15%15\% end-to-end training speedup against the prior MLPerf 1.1 record.
  • GPT-2 (sequence length 1K): 3×3\times faster than HuggingFace/Megatron-LM baselines.
  • Long-Range Arena (sequence length 1K–4K): 2.4×2.4\times speedup.
  • IO reduction is sometimes an order of magnitude, providing parallel improvements in the backward pass even when this involves additional FLOPs (due to recomputation of attention scores blockwise).

The reduced memory footprint from not materializing large matrices enables the use of longer contexts, improving downstream metrics such as perplexity (+0.7 on GPT-2) and document classification accuracy (+6.4 points), and supports entirely new regime benchmarks (e.g., solving Path-X/Path-256 long-context challenges at 16K/64K sequence lengths).

4. Block-Sparse and Exact/Approximate Extensions

Beyond the standard dense formulation, FlashAttention kernels are extended to block-sparse attention. When supported by a block-sparsity mask, the kernel computation omits fixed zero blocks, reducing IO further by a sparsity factor ss (Dao et al., 2022). The IO for block-sparse FlashAttention is O(Nd+N2d2s/M)O(Nd + N^2 d^2 s / M) with ss reflecting the nonzero proportion of the attention map.

Appropriate design of the block mask (e.g., using butterfly or regular sparse patterns) allows both high speedup and quality preservation. For instance, block-sparse FlashAttention can outperform approximate methods at equivalent memory footprints, and is competitive with, or faster than, leading approaches when scaling to hundreds of thousands of tokens.

5. Implications for Transformer Training and Model Design

The FlashAttention kernel and its extensions have several direct implications:

  • Linear Memory Scaling: Avoidance of N×NN \times N intermediates allows practical batch sizes and long sequence lengths on fixed GPU memory.
  • Training Efficiency: Wall-clock acceleration (e.g., BERT-large 15%, GPT-2 up to 3x) enables more rapid iteration and reduces cost in large-model development.
  • Quality and Capacity: The ability to train on substantially longer contexts improves perplexity, classification accuracy, and generalization in long-sequence domains.
  • Enabling New Capabilities: FlashAttention is foundational to the first Transformers demonstrating better-than-chance performance on challenging tasks requiring long-range context integration, e.g., Path-X (16K) and Path-256 (64K) (Dao et al., 2022).

6. Contextualization within the Broader Efficient Attention Landscape

While FlashAttention focuses on exact softmax attention via IO-aware, tiled kernels, subsequent developments have built on these principles:

  • Kernel-fused and asynchrony-optimized variants (FlashAttention-2, FlashAttention-3) further overlap data transfer and computation for specific GPU microarchitectures.
  • Hardware implementations (e.g., with fused exponential-multiply units, hardware-friendly simplifications) optimize the basic kernel for ASIC and FPGA accelerators.
  • Compiler-driven and LLM-generated approaches (e.g., FlexAttention, QiMeng-Attention) abstract away hardware-specific implementation, generalizing the core FlashAttention principles to a wider set of architectures and variants.
  • Extensions for quantized and low-precision computation (FP8, INT8), as well as generalizations to mask-rich and block-sparse paradigms (FlashMask, Flash Sparse Attention), demonstrate the flexibility and long-term viability of the IO-aware, block-aggregated design.

7. Summary Table: Core Characteristics of Flash-Attention Kernels

Feature Standard Attention FlashAttention Kernel Block-Sparse FlashAttention
Memory IO O(N2d2)O(N^2 d^2) O(N2d2/M)O(N^2 d^2 / M) O(Nd+N2d2s/M)O(Nd + N^2 d^2 s / M)
Intermediate S Materialized Never materialized Never materialized on masked blocks
Mask Support Dense, basic Dense, basic/block-sparse Block-sparse masks
Precision FP32/BF16 FP32/BF16, extended to FP8, INT8 As above
Empirical wall-clock speedup Baseline 1.13×1.1-3\times+ Up to >2×>2\times on some benchmarks

8. Future Directions

The framework established by FlashAttention kernels catalyzes a broad ongoing research agenda: highly efficient tiling and aggregation schemes for a diversity of attention types, compatibility with quantized and low-precision memory layouts, and algorithm–hardware codesign for scalable Transformers on both existing and future accelerator architectures. The core algebraic and IO-aware principles established in (Dao et al., 2022) now undergird much of the volunteer ecosystem for GPU-optimized attention operators in modern LLM systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)