Papers
Topics
Authors
Recent
Search
2000 character limit reached

FlashAttention: Efficient Tiled Self-Attention

Updated 23 January 2026
  • FlashAttention is an IO-optimal, tiled attention method that reorganizes self-attention to reduce memory traffic by streaming data in optimized blocks.
  • It accelerates Transformer training and inference on GPUs by fusing kernels and eliminating the quadratic memory usage typical in standard attention implementations.
  • Advanced variants and hardware optimizations, such as FlashAttention-2, FP8 support, and systolic array techniques, further enhance its performance and adaptability.

FlashAttention (FA) is a class of IO-optimal, tiled, and fusion-based attention algorithms and kernels, designed to compute exact softmax self-attention on modern hardware with dramatically reduced memory traffic and much higher throughput compared to conventional quadratic-memory attention implementations. By avoiding N×N buffer materialization and efficiently exploiting the fast on-chip memory hierarchy of GPUs and dedicated accelerators, these kernels deliver substantial speedups in Transformer training and inference, especially for long-context and large-model workloads. FlashAttention forms the computational backbone of widely used Transformer libraries and hardware designs and has inspired numerous extensions in quantization, hardware specialization, token pruning, bias handling, and compiler frameworks.

1. Principles and Computational Structure

FlashAttention reorganizes the standard self-attention operation

O=softmax(QK)V,Q,K,VRN×dO = \mathrm{softmax}(QK^\top)V,\quad Q, K, V \in \mathbb{R}^{N \times d}

by introducing algorithmic tiling and kernel fusion strategies to minimize expensive off-chip memory (HBM) accesses. Instead of materializing the full N×N score, mask, and softmax matrices, FlashAttention streams Q, K, and V in blocks that fit within on-chip SRAM or GPU shared memory. The inner kernel maintains only per-block statistics: rowwise running maxima mim_i and normalization terms i\ell_i, performing the softmax normalization in an online, numerically stable fashion that enables accumulation and merging across blocks.

The computation for each query block QiQ_i and key/value block Kj,VjK_j, V_j proceeds as:

  • Compute Sij=QiKjS_{ij} = Q_i K_j^\top in on-chip memory.
  • Update running max mimax(mi,rowmax(Sij))m_i \leftarrow \max(m_i, \mathrm{rowmax}(S_{ij})).
  • Accumulate iiemioldmi+rowsum(eSijmi)\ell_i \leftarrow \ell_i \cdot e^{m_i^{\mathrm{old}} - m_i} + \mathrm{rowsum}(e^{S_{ij} - m_i}).
  • Update output O~iemioldmiO~i+eSijmiVj\tilde{O}_i \leftarrow e^{m_i^{\mathrm{old}} - m_i} \tilde{O}_i + e^{S_{ij}-m_i}V_j.
  • After processing all blocks, scale: Oi=O~i/iO_i = \tilde{O}_i / \ell_i.

This "IO-aware" reorganization reduces memory traffic from O(N2)O(N^2) to O(Nd)O(N\,d), since each block is loaded only once, and no large intermediate matrices leave on-chip memory (Dao et al., 2022).

2. Algorithmic Families and Modern Variants

Since its introduction, FlashAttention has evolved into multiple high-performance generations and hardware-specific variants:

  • FlashAttention-2 employs improved work partitioning—parallelizing over block-rows for forward and block-columns for backward, adopting warp-based "split-Q" tiling, and reducing non-GEMM FLOP counts—yielding 2×\times–4×\times speedup over v1 and reaching 50–73% of A100 peak FLOPs (Dao, 2023).
  • FlashAttention-3 leverages Hopper GPU features, specifically asynchronous Tensor Cores, TMA engines, scheduling asynchrony between memory and compute, and block-wise FP8 quantization. By overlapping GEMM and softmax phases, warp specialization, and incoherent block-wise quantization, it reaches 75% of H100 peak in FP16 and >90% peak in FP8, with minimal accuracy loss (Shah et al., 2024).
  • INT-FlashAttention extends FlashAttention to INT8 quantization supporting fully INT8 Q, K, and V, achieving up to 72% faster inference and major memory savings on Ampere GPUs (Chen et al., 2024).
  • FLASH-D and H-FA introduce alternative kernel formulations for hardware, reducing area and power by hiding or fusing softmax division and exponential steps via, respectively, a sigmoid recurrence and fixed-point logarithmic domain arithmetic (Alexandridis et al., 20 May 2025, Alexandridis et al., 31 Oct 2025).

Additionally, the algorithm has been specialized in multiple domains, including vector-SIMD (RISC-V), systolic arrays, and compiler-automated fusion frameworks (Titopoulos et al., 8 Oct 2025, Lin et al., 15 Jul 2025, You et al., 3 Nov 2025).

3. Memory Complexity and IO-Optimality

Key to FlashAttention’s efficacy is IO-optimality—minimizing reads and writes between slow HBM and fast on-chip SRAM. In the standard implementation,

  • Standard attention: Θ(Nd+N2)\Theta(Nd + N^2) HBM reads and writes.
  • FlashAttention (for SRAM size SS): Θ(N2d2/S)\Theta(N^2 d^2 / S) HBM traffic, approaching the information-theoretic lower bound for all S[d,Nd]S \in [d, Nd].

For long sequences, this reduction is critical. For instance, in GPT-2 and BERT benchmarks, FlashAttention achieves up to 3×\times end-to-end speedup and enables context lengths up to 64K on commodity GPUs (Dao et al., 2022).

4. Algorithmic Extensions and Token Pruning

FlashAttention’s blocking and online merging interface enable generic extension to multiple computational motifs:

  • Block-sparse attention (e.g. for sparse mask patterns): Only nonzero blocks are computed and streamed, directly reducing IO and compute. Empirically, for large models and masking patterns (s1/Ns \approx 1/\sqrt{N}), subquadratic IO and time complexity is achievable (Dao et al., 2022).
  • FlashMask: For arbitrary attention masks, FlashMask replaces dense N×NN\times N masks with a column-wise sparse encoding (four NN-length arrays), maintaining O(N)O(N) mask memory and enabling block-level masking skips. This results in up to 3.2×\times kernel speedups and support for context lengths up to 544K tokens (Wang et al., 2024).
  • Pruning and compression: Representation Shift provides a model-agnostic token significance metric, allowing on-the-fly pruning integrated with the FlashAttention fused kernel, resulting in up to 5.5×\times speedup and negligible accuracy degradation on retrieval and QA benchmarks (Choi et al., 1 Aug 2025).
  • FlashBias: Reduces IO and preserves speed for attention layers with bias by exploiting low-rank structure in bias matrices, using factorization and input extension to avoid streaming the dense N×NN\times N bias (Wu et al., 17 May 2025).

5. Hardware Specialization and Precision Optimization

FlashAttention-inspired algorithms have been migrated and further optimized for dedicated hardware:

  • SystolicAttention: Fuses all FlashAttention steps within a single 2D systolic array, eliminating reliance on external vector/scalar units and achieving \sim4.8×\times utilization gains over commercial accelerator cores (TPUv5e, NeuronCore-v2), with \approx10% area overhead (Lin et al., 15 Jul 2025).
  • FPGA/ASIC pipeline specialization: Fused operators for exVe^{x}V (ExpMul), fixed-point log-domain arithmetic (H-FA), and sigmoid-based flash division (FLASH-D) provide 20–29% area and power reductions versus separate floating/fixed point pipelines with no accuracy or throughput loss (Alexandridis et al., 20 May 2025, Alexandridis et al., 20 May 2025, Alexandridis et al., 31 Oct 2025).
  • Vector architectures: Vectorized FlashAttention achieves \sim30×\times speedup over scalar code on RISC-V vector processors, applying fast approximate exponentials achievable solely with baseline vector instructions and no ISA extensions (Titopoulos et al., 8 Oct 2025).

These results collectively demonstrate FlashAttention’s hardware amenability and the value of arithmetic simplification.

6. Precision Scaling and Quantization

Advanced FlashAttention variants exploit the IO and tiling design to support efficient low-precision attention:

  • FP16/FP8/INT8 support: FlashAttention-3 and INT-FlashAttention fully support quantized activations, via block-wise and token-level scaling. INT-FlashAttention in particular provides a symmetric, linear, per-token INT8 quantization, fully compatible with the fused tilewise kernel, delivering 72% faster inference and up to 82% reduced quantization error versus FP8 baselines (Shah et al., 2024, Chen et al., 2024).
  • Accuracy engineering: FlashAttention-3 reduces FP8 attention error by 2.6×\times via incoherent random rotation and block quantization, matching or exceeding standard FP16/FP8 accuracy for large-scale LLMs (Shah et al., 2024).
  • Downscaling to INT4/INT2: The same pipeline can be generalized to lower precision (block-wise INT4/INT2), providing further memory compression with modest error tradeoffs (Chen et al., 2024).

7. Compiler Automation and Ecosystem Integration

FlashAttention has influenced the design of compiler-driven frameworks for automatic kernel fusion:

  • FlashLight: A PyTorch compiler-native extension that automatically converts general attention code in Python into fused, tile-wise, FlashAttention-style kernels. It supports all variants expressible in template systems (e.g., FlexAttention) and more, delivering 5–10×\times speedups for data-dependent attention schemes, with no programmer kernel engineering (You et al., 3 Nov 2025).
  • Backwards compatibility and extensibility: FlashAttention is now the standard in major large-model and long-context training stacks, supports all standard masking (causal, arbitrary), is compatible with bias and quantization variants, and generalizes across language, vision, and generative models (Dao, 2023, Dao et al., 2022, Shah et al., 2024).

References

This body of work establishes FlashAttention as both a foundational algorithmic technique and a catalyst for subsequent advancements in efficient, scalable attention computation across software and hardware.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashAttention (FA).