Papers
Topics
Authors
Recent
2000 character limit reached

FlashAttention: Memory-Efficient Attention

Updated 29 November 2025
  • FlashAttention is a memory-hierarchy-aware algorithm that optimizes Transformer self-attention by fusing compute and memory-access improvements.
  • It employs tiling of Q, K, and V into blocks that fit in on-chip SRAM, minimizing I/O between fast cache and slower high-bandwidth memory.
  • The algorithm achieves robust cross-platform efficiency with up to 3–4× speedups and maintains numerical stability crucial for modern deep learning.

The FlashAttention algorithm is a memory-hierarchy-aware implementation for exact softmax-based attention in Transformer architectures. Its fundamental contribution is the fusion of compute and memory-access optimizations—specifically, its use of tiling across the GPU’s fast on-chip memory (SRAM) to minimize I/O between the GPU’s large but bandwidth-limited high-bandwidth memory (HBM) and its small but fast cache. FlashAttention is information-theoretically optimal with respect to I/O complexity for a wide range of accelerator architectures and demonstrates robust efficiency across GPU generations, hardware ASICs, and custom vector processors (Saha et al., 12 Feb 2024).

1. Self-Attention Formulation and I/O Bottleneck

Self-attention in Transformers computes, for each sequence of length NN and head-dimension dd,

O=softmax(QK)V=D1exp(QK)V,D=diag(exp(QK)1),O = \mathrm{softmax}(Q K^\top) V = D^{-1} \exp(QK^\top) V, \qquad D = \mathrm{diag}\bigl(\exp(QK^\top)\mathbf{1}\bigr),

with Q,K,VRN×dQ,K,V \in \mathbb{R}^{N \times d}. A naïve algorithm must construct the full N×NN \times N attention matrix, leading to Θ(N2d)\Theta(N^2 d) compute complexity and, crucially, Θ(N2)\Theta(N^2) memory traffic as intermediate results must be read from and written to HBM. These memory transfers are the dominant contributor to wall-clock time on modern accelerators.

2. FlashAttention Tiling and Streaming Strategy

FlashAttention sidesteps the quadratic I/O bottleneck by tiling QQ, KK, and VV into blocks of size BB that fit into on-chip SRAM of total size MM. For each block,

  • A B×dB \times d tile of QQ and a d×Bd \times B tile of KK^\top are loaded into SRAM.
  • Their product forms an in-cache B×BB \times B slice of the attention matrix; the exponential and row-sum updates (i.e., online softmax) are performed immediately, thus no intermediate N×NN \times N matrix is ever written to slow memory.
  • The same tiling is applied in the exp(QK)V\exp(QK^\top)V multiply.

Each element of QQ, KK, and VV as well as each partial product is streamed exactly once between HBM and SRAM. This structure fully leverages cache locality and minimizes redundant memory access.

3. Information-Theoretic I/O Complexity

FlashAttention attains the following I/O complexity upper bound for attention computation: QIO(N,d,M)=O(N2d2M),Q_{\rm IO}(N,d,M) = O\left(\frac{N^2 d^2}{M}\right), where NN is the sequence length, dd the head dimension, and MM the cache size. The derivation selects tile size BMB \approx \sqrt{M}, partitions the attention matrix into (N/B)2(N/B)^2 blocks, and computes the total I/O via the number of required loads and stores per tile.

A matching lower bound is established: QIO(N,d,M)=Ω(N2d2M),Md2,Q_{\rm IO}(N,d,M) = \Omega\left(\frac{N^2 d^2}{M}\right), \quad M \geq d^2, by reduction to a matrix-compression communication complexity argument: any algorithm restricted to two-level memory with SRAM of size MM cannot produce the full set of N2N^2 dot-products in fewer I/O operations due to strict information limits. The lower bound holds even if fast matrix multiplication algorithms (Strassen, Coppersmith–Winograd) are used—arithmetic speed gains do not translate to reduced I/O bottleneck.

For M<d2M < d^2, the optimal regime is: QIO(N,d,M)=Θ(N2dM),Q_{\rm IO}(N,d,M) = \Theta\left(\frac{N^2 d}{\sqrt{M}}\right), with standard M\sqrt{M}-tiling applied to the relevant sub-matrices.

4. Numerically Stable Online Softmax

FlashAttention implements an online softmax in tandem with tiled computation. For a single query q\mathbf{q}, iterating over key-value pairs (ki,vi)(\mathbf{k}_i, \mathbf{v}_i), the updates are: si=qki,mi=max(mi1,si), i=i1emi1mi+esimi, oi=oi1emi1mi+viesimi,s_i = \mathbf{q} \cdot \mathbf{k}_i, \qquad m_i = \max(m_{i-1}, s_i), \ \ell_i = \ell_{i-1} e^{m_{i-1} - m_i} + e^{s_i - m_i}, \ \mathbf{o}_i = \mathbf{o}_{i-1} e^{m_{i-1} - m_i} + \mathbf{v}_i e^{s_i - m_i}, with final normalization Attn(q,K,V)=oN/N\mathrm{Attn}(\mathbf{q},K,V) = \mathbf{o}_N / \ell_N. This “max-plus-logsumexp” mechanism maintains strict numerical stability and prevents overflow/underflow, vital for vectorized and hardware semantic compatibility (Alexandridis et al., 20 May 2025).

5. Hardware Implementation and Architectural Variants

GPU Kernels

On NVIDIA Hopper architecture, FlashAttention-2 is expressed and benchmarked as a fully fused CUDA kernel using CUTLASS primitives, with online softmax fused with GEMM operations, maximal use of TMA for asynchronous data loading, and warpgroup Matrix-Multiply-Accumulate scheduling. Empirically, this yields 20–50% higher FLOPs/s than prior Ampere-optimized kernels, scaling efficiently up to 700 TFLOP/s for large sequence lengths (Bikshandi et al., 2023).

Dedicated ASIC Accelerators

FlashAttention-inspired hardware has introduced ExpMul fused operators, which merge exponential and vector-multiply into a single pipeline. Such designs, synthesized at 28nm, achieve average area reductions of 28.8% and power savings of 17.6% relative to prior separate exp-mul architectures, without measurable inference degradation (Alexandridis et al., 20 May 2025). Alternative designs (H-FA, FLASH-D) have employed log-domain arithmetic and hidden sigmoid normalization to further reduce hardware cost and simplify division logistics (Alexandridis et al., 31 Oct 2025, Alexandridis et al., 20 May 2025).

Systolic Arrays

SystolicAttention maps the entire fused attention loop (both matmuls and softmax) within the systolic array fabric, augmenting with upward data paths and split PE units for piecewise-linear exponential approximation. This design achieves sustained 1.77×–4.83× higher realized attention FLOPs/s compared to state-of-the-art NeuronCore-v2/TPUv5e accelerators, with only 10% area overhead (Lin et al., 15 Jul 2025).

Vector Processors

FlashAttention-2 vectorization for RISC-V leverages all-software exponential approximations and tile layouts optimized for cache locality and register file width. Experiments confirm speedups up to 31× over scalar reference for practical LLM inference, without ISA extensions and preserving the numerical integrity of the core algorithm (Titopoulos et al., 8 Oct 2025).

6. Extensions and Algorithmic Flexibility

Sparse FlashAttention generalizes the tiling and streaming paradigm to dynamic sparse attention patterns, including key/query dropping, bucketed attention, and hash-based masking. These variants retain the quadratic arithmetic cost but reduce real runtime and memory traffic proportional to the degree of block skipping, yielding 1.8×–3.3× speedups in practical training at no loss in perplexity or model accuracy (Pagliardini et al., 2023).

FlashAttention has further evolved in FlashAttention-2 and FlashAttention-3:

  • FlashAttention-2 improves parallelism and intra-block work partitioning to reach 50–73% of theoretical FLOP utilization on modern GPUs and up to 225 TFLOP/s in end-to-end GPT-style model training (Dao, 2023).
  • FlashAttention-3 exploits asynchrony between tensor compute and memory engines, interleaves block-wise softmax/GEMM operation, and incorporates blockwise FP8 quantization, driving attention throughput on H100 GPUs to 1.2 PFLOP/s and lowering FP8 error by 2.6× (Shah et al., 11 Jul 2024).

7. Practical Implications and Integration

Realization of near-optimal I/O and high utilization requires:

  • Sufficient SRAM cache (Md2M \geq d^2) per attention head to enable the O(N2d2/M)O(N^2 d^2 / M) regime.
  • Careful selection of tile sizes (BMB \approx \sqrt{M}) matched to hardware register capacity and memory bandwidth.
  • Fusion of softmax and matmul operations within each block so that intermediate N×NN \times N slices never touch slow memory.
  • For hardware, exp-mul fusion operators, systolic array enhancements, and vectorized software primitives simplify critical-path logic, reduce area and power, and allow scalable multi-query/parallelism (Saha et al., 12 Feb 2024, Alexandridis et al., 20 May 2025, Lin et al., 15 Jul 2025, Titopoulos et al., 8 Oct 2025).

FlashAttention has become foundational for scaling long-context Transformers: enabling empirical speedups (up to 3–4×), drastic memory savings (linear in sequence length), and direct acceleration on new memory-centric hardware and vector processors.


Key Papers

Variant Architecture Performance/Complexity
FlashAttention GPU (SRAM+HBM) O(N2d2/M)O(N^2d^2/M) I/O optimal
FlashAttention-2 GPU (CUTLASS/Hopper) 50–73% of GEMM FLOPs, 225 TFLOP/s (Dao, 2023, Bikshandi et al., 2023)
ExpMul/FLASH-D/H-FA ASIC/FPGA/log-domain 22–29% area, 18–23% power savings (Alexandridis et al., 20 May 2025, Alexandridis et al., 31 Oct 2025, Alexandridis et al., 20 May 2025)
SystolicAttention SA (custom array) 1.77×–4.83× higher util. (Lin et al., 15 Jul 2025)
Sparse FlashAttn GPU + mask 1.8–3.3× speedup, exact (Pagliardini et al., 2023)
RISC-V Vectorized Vector processor 31× scalar speedup, full exactness (Titopoulos et al., 8 Oct 2025)

All quantitative claims and technical workflow steps are sourced directly from the referenced publications.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FlashAttention Algorithm.