Flash-Attention Kernels in Transformer Models

Updated 2 September 2025

Flash-Attention kernels are GPU-optimized algorithms that tile and fuse attention computations to minimize memory IO and accelerate softmax-based self-attention in Transformers.
They integrate score computation, softmax normalization, and value aggregation into on-chip operations, achieving optimal IO complexity and noticeable speedups.
Extensions such as block-sparse attention and quantized implementations enable longer context processing and broader hardware compatibility while maintaining accuracy.

Flash-Attention kernels are a family of GPU-optimized attention algorithms and implementations designed to accelerate the computation of softmax-based self-attention in Transformer models. The canonical FlashAttention kernel achieves this by reorganizing and tiling the attention computation such that the memory and computational complexity per GPU pass is minimized, particularly with respect to IO between high-bandwidth memory (HBM) and on-chip scratchpad SRAM. By fusing the computation of attention scores, softmax normalization, and value-weighted sums into hardware-aware blocks, FlashAttention delivers both theoretically optimal IO complexity for a variety of practical SRAM sizes and significant empirical speedups—without sacrificing the exactness of softmax attention. Extensions to the kernel framework provide further speed and memory improvements for block-sparse attention, quantized data types, and new hardware architectures.

1. Algorithmic Foundations and Tiled Attention Computation

Traditional (dense) softmax-based self-attention for a sequence of length $N$ and head dimension $d$ computes a $N \times N$ matrix of attention scores $S = QK^\top$ , reduces each row via softmax (numerically stabilized as $P_{i,:} = \exp(S_{i,:} - m_i)/\ell_i$ , with $m_i = \max_j S_{i,j}$ and $\ell_i = \sum_j \exp(S_{i,j} - m_i)$ ), and computes the output $O = PV$ . This construction incurs $O(N^2)$ memory and compute, particularly when $S$ must be explicitly materialized in large matrices stored in GPU HBM.

FlashAttention (Dao et al., 2022) eliminates the need to materialize $S$ (and the normalized matrix $P$ ) in global memory by introducing tiling: $Q$ , $K$ , and $V$ are partitioned into submatrices, and the attention computation is performed block-by-block using only on-chip SRAM. This is feasible due to the decomposability of the softmax, which allows incremental and numerically stable aggregation of partial softmax normalizers and outputs:

Each query block $Q_i$ is matched in turn with key and value blocks $K_j$ , $V_j$ , with the partial scores $S_{ij} = Q_i K_j^\top$ computed and normalized.
Running softmax statistics—block-wise row maxima $m_i$ and sums $\ell_i$ —are maintained and incrementally updated using the algebra of the exponential and max operators:

$m_{\text{total}} = \max(m_{\text{prev}}, m_{\text{curr}}),$

$\ell_{\text{total}} = \exp(m_{\text{prev}} - m_{\text{total}})\ell_{\text{prev}} + \exp(m_{\text{curr}} - m_{\text{total}})\ell_{\text{curr}}.$

Final outputs are updated via an aggregation step, fusing computations across blocks without intermediate writes of $S$ or $P$ to slow memory.

This approach is formalized in the algorithmic pseudocode and block algebra in (Dao et al., 2022), where the precise goal is to never materialize the $N \times N$ attention matrix in HBM, instead updating the output via local SARAM-resident computation.

2. IO-Awareness and Memory Complexity

The principal innovation in FlashAttention is explicit IO-aware algorithm design. Analysis demonstrates that with input tiling and on-chip SRAM of capacity $M$ , the number of HBM accesses is reduced from $O(N^2)$ for conventional attention to $O(N^2 d^2 / M)$ (with $d$ the head dimension). This is optimal for a broad range of $M$ , allowing each block to be loaded once into fast memory, reused maximally, and only the output and minimal softmax statistics are written out per kernel pass.

This IO (rather than FLOPs) is the main cost for attention on modern GPUs. Traditional methods, including most “approximate” attention algorithms, often reduce individual matrix operation complexity but are bottlenecked by memory traffic. By treating reads and writes as a first-class design objective and balancing shared memory utilization, FlashAttention universally reduces wall-clock runtime even when all computations remain exact.

3. Performance Metrics and Empirical Improvements

In head-to-head empirical evaluations, FlashAttention and IO-aware extensions demonstrate consistent speedups:

BERT-large (sequence length 512): $15\%$ end-to-end training speedup against the prior MLPerf 1.1 record.
GPT-2 (sequence length 1K): $3\times$ faster than HuggingFace/Megatron-LM baselines.
Long-Range Arena (sequence length 1K–4K): $2.4\times$ speedup.
IO reduction is sometimes an order of magnitude, providing parallel improvements in the backward pass even when this involves additional FLOPs (due to recomputation of attention scores blockwise).

The reduced memory footprint from not materializing large matrices enables the use of longer contexts, improving downstream metrics such as perplexity (+0.7 on GPT-2) and document classification accuracy (+6.4 points), and supports entirely new regime benchmarks (e.g., solving Path-X/Path-256 long-context challenges at 16K/64K sequence lengths).

4. Block-Sparse and Exact/Approximate Extensions

Beyond the standard dense formulation, FlashAttention kernels are extended to block-sparse attention. When supported by a block-sparsity mask, the kernel computation omits fixed zero blocks, reducing IO further by a sparsity factor $s$ (Dao et al., 2022). The IO for block-sparse FlashAttention is $O(Nd + N^2 d^2 s / M)$ with $s$ reflecting the nonzero proportion of the attention map.

Appropriate design of the block mask (e.g., using butterfly or regular sparse patterns) allows both high speedup and quality preservation. For instance, block-sparse FlashAttention can outperform approximate methods at equivalent memory footprints, and is competitive with, or faster than, leading approaches when scaling to hundreds of thousands of tokens.

5. Implications for Transformer Training and Model Design

The FlashAttention kernel and its extensions have several direct implications:

Linear Memory Scaling: Avoidance of $N \times N$ intermediates allows practical batch sizes and long sequence lengths on fixed GPU memory.
Training Efficiency: Wall-clock acceleration (e.g., BERT-large 15%, GPT-2 up to 3x) enables more rapid iteration and reduces cost in large-model development.
Quality and Capacity: The ability to train on substantially longer contexts improves perplexity, classification accuracy, and generalization in long-sequence domains.
Enabling New Capabilities: FlashAttention is foundational to the first Transformers demonstrating better-than-chance performance on challenging tasks requiring long-range context integration, e.g., Path-X (16K) and Path-256 (64K) (Dao et al., 2022).

6. Contextualization within the Broader Efficient Attention Landscape

While FlashAttention focuses on exact softmax attention via IO-aware, tiled kernels, subsequent developments have built on these principles:

Kernel-fused and asynchrony-optimized variants (FlashAttention-2, FlashAttention-3) further overlap data transfer and computation for specific GPU microarchitectures.
Hardware implementations (e.g., with fused exponential-multiply units, hardware-friendly simplifications) optimize the basic kernel for ASIC and FPGA accelerators.
Compiler-driven and LLM-generated approaches (e.g., FlexAttention, QiMeng-Attention) abstract away hardware-specific implementation, generalizing the core FlashAttention principles to a wider set of architectures and variants.
Extensions for quantized and low-precision computation (FP8, INT8), as well as generalizations to mask-rich and block-sparse paradigms (FlashMask, Flash Sparse Attention), demonstrate the flexibility and long-term viability of the IO-aware, block-aggregated design.

7. Summary Table: Core Characteristics of Flash-Attention Kernels

Feature	Standard Attention	FlashAttention Kernel	Block-Sparse FlashAttention
Memory IO	$O(N^2 d^2)$	$O(N^2 d^2 / M)$	$O(Nd + N^2 d^2 s / M)$
Intermediate S	Materialized	Never materialized	Never materialized on masked blocks
Mask Support	Dense, basic	Dense, basic/block-sparse	Block-sparse masks
Precision	FP32/BF16	FP32/BF16, extended to FP8, INT8	As above
Empirical wall-clock speedup	Baseline	$1.1-3\times$ +	Up to $>2\times$ on some benchmarks

8. Future Directions

The framework established by FlashAttention kernels catalyzes a broad ongoing research agenda: highly efficient tiling and aggregation schemes for a diversity of attention types, compatibility with quantized and low-precision memory layouts, and algorithm–hardware codesign for scalable Transformers on both existing and future accelerator architectures. The core algebraic and IO-aware principles established in (Dao et al., 2022) now undergird much of the volunteer ecosystem for GPU-optimized attention operators in modern LLM systems.

PDF Markdown Chat (Pro)

References (1)

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Flash-Attention Kernels.