FlashAttention Optimizations

Updated 26 March 2026

FlashAttention Optimizations are techniques that restructure Transformer attention computations using on-chip tiling and fused operations to minimize the quadratic memory traffic of standard implementations.
They leverage advanced parallelism, work partitioning, and algebraic fusion to sustain high GPU occupancy and reduce latency across diverse hardware platforms.
These methods also extend to low-precision arithmetic, sparse and quantized attention, and specialized accelerator mappings, achieving significant speedups and energy efficiency improvements.

FlashAttention refers to a family of IO- and compute-aware exact attention algorithms for Transformers, designed to exploit the memory hierarchy of modern accelerators. Over the past few years, an extensive body of research has dissected, extended, and hardware-optimized FlashAttention, yielding both algorithmic and architectural advances to approach theoretical throughput, minimize latency, and broaden applicability to diverse attention variants and hardware substrates.

1. Foundational Algorithm and IO-Awareness

The core insight behind FlashAttention is that quadratic memory traffic—not just arithmetic FLOPs—dominates the cost of scaled dot-product attention at typical sequence lengths and GPU configurations. Standard attention implementations require $O(N^2)$ reads/writes to high-bandwidth memory (HBM) due to explicit materialization of the $S=QK^\top$ or $P=\text{softmax}(S)$ matrices. FlashAttention restructures the computation to tile $Q, K, V$ into small blocks that fit in on-chip SRAM, stream each tile through shared memory in a way that fuses matmul and softmax, and avoids ever writing intermediate $O(N^2)$ results out to HBM.

A typical FlashAttention kernel makes $O(N^2/T + N T)$ HBM accesses, with $T$ as the tile size, attaining IO-complexity optimality for $O(Nd)$ SRAM ( $d$ the head dimension) (Dao et al., 2022). The tiles accumulate softmax normalization statistics and sum results "online" during streaming. This enables long-context training and inference within practical system memory constraints and underpins the explosive adoption of FlashAttention in LLMs and generative models.

2. Parallelism, Fusion, and Tiling Strategies

Continued work has systematically improved the parallelism and hardware utilization of FlashAttention, especially under the constraints of specific accelerator architectures. FlashAttention-2 introduces three major refinements (Dao, 2023):

Work Partitioning: The sequence is partitioned into row- and column-tiles. Each thread block is mapped to a (batch, head, row-tile) triple, spreading compute across many parallel units and sustaining high GPU occupancy for large $N$ .
Warp-level "split-Q" Parallelism: Within a thread block, each warp handles a subset of query rows, allowing all warps to independently stream through $K, V$ tiles, accumulate online softmax stats and outputs, and thus eliminate cross-warp reduction or shared-memory communication.
Algebraic Fusion to Minimize Non-Matmul FLOPs: Non-matmul FLOPs for elementwise operations and softmax normalization are halved by moving as much work as possible into high-throughput matmul units (Tensor Cores) and delaying the final normalization step, thus maximizing arithmetic intensity.

Advanced implementations further advocate autotuning tile sizes and register/shared memory usage (Bikshandi et al., 2023, Lin et al., 15 Jul 2025) and recognize that for best throughput, register pressure must be carefully balanced against occupancy, often preferring $64\times128$ tiles for FP16/FP32 on NVIDIA Hopper H100 GPUs. Fusing the complete $QK^\top\to\text{softmax}\to (P)V$ computation into a single CUDA or hardware pipeline is now standard.

3. Hardware Microarchitectures and Accelerator Extensions

A major research thrust addresses mapping FlashAttention efficiently onto custom accelerators beyond NVIDIA GPUs. Multiple directions have been developed:

Fused Systolic Arrays (FSA): To address the mismatch between the interleaved matmul and softmax of FlashAttention and standard systolic array architectures—which prefer large, uninterrupted matrix multiplies—SystolicAttention proposes an enhanced array with upward data paths, elementwise CMP units for running rowmax updates, and in-PE piecewise-linear exp2 approximation (Lin et al., 15 Jul 2025). This architecture preserves the exact floating-point operation order for stability and overlaps all FlashAttention stages at per-element granularity, yielding $1.77\times$ to $4.83\times$ higher utilization over state-of-the-art TPU/NeuronCore designs for long sequence inference.
3D-Stacked Spatial Accelerators: 3D-Flow spatially partitions the FlashAttention pipeline across vertically stacked PE tiers, with cycle-level TSV (through-silicon via) links enabling direct register-to-register streaming between stages (matmul, row-reduce, exponentiate, value-multiply) (Yu et al., 11 Feb 2026). This removes the SRAM buffer round-trips endemic to 2D deployments, reduces on-chip energy by up to 93%, and delivers 1.4–7.6 $\times$ speedups (OPT, QWEN workloads) over 2D or cache-intensive 3D baselines.
Fused ExpMul Operators: Custom hardware that fuses exp and vector multiplication, i.e., computes $e^x \cdot V$ in a single datapath (via logarithmic quantization and fixed-point exponent handling), substantially reduces area (by 28.8%) and power (by 17.6%) over discrete exp and multiply units in synthesized 28nm ASICs (Alexandridis et al., 20 May 2025). This "ExpMul" operator allows full streaming without costly look-up tables or pipeline stalls.
Hybrid Logarithmic Computation: H-FA migrates softmax normalization and value multiplication to log-domain fixed-point arithmetic, using binary logarithms and the Mitchell approximation to implement all core products, sums, and quotients via shift and add (Alexandridis et al., 31 Oct 2025). FlashAttention’s dot-product remains in floating-point for accuracy, but the combined log-domain kernel reduces hardware area and power by over 25%, with negligible accuracy impact.
Division-Free and Sigmoid-Parametrized Softmax: FLASH-D recasts softmax normalization as a recurrence over sigmoid activations, algebraically eliminating per-step division and dual-exp pipelines, and requiring only a single bounded sigmoid/PWL-log in place of the heavy special-function unit (Alexandridis et al., 20 May 2025).

These microarchitecture optimizations are often validated via RTL implementations, cycle-accurate simulations, layout, and power benchmarking in 28nm–16nm nodes.

4. Low-Precision, Quantized, and Specialized GPU Implementations

To approach peak device throughput, several optimizations target low-precision arithmetic, asynchrony, and kernel fusion:

Quantization: FP8 kernels in FlashAttention-3 fully leverage Hopper Tensor Core support, with block-level dynamic quantization and incoherent preprocessing (random orthogonal transforms), achieving up to $1.2$ PFLOPs/s and bringing RMSE within $2.6\times$ of FP32 (Shah et al., 2024). INT-FlashAttention extends this to INT8 activations, proposing token-level scaling and calibration for Q/K and tensor-level scaling for V, with DP4A-accelerated GEMMs and quantized softmax weights. On A100/Ampere, this yields 72% higher inference speed and up to 82% lower quantization error compared to FP8/FP16 (Chen et al., 2024).
Vectorization on Non-GPU Architectures: Efficient vectorized FlashAttention kernels on RISC-V vector processors replace scalar code with per-block updates, utilize low-instruction-cost exponentials (via floating-point bit tricks and Blinn's method), and implement 2D tiling to maximize locality (Titopoulos et al., 8 Oct 2025). Kernel throughput scales linearly with vector length up to L1/L2 limits, delivering up to $30\times$ speedups over scalar baselines.
Pipelining and Asynchrony: FlashAttention-3 and -4 systematically exploit hardware feature asymmetry. FA-3 designers on Hopper use warp-group specialization to decouple Tensor Core (WGMMA) compute from TMA data loads, letting each warp-group run asynchronously, overlapping communication and computation (Shah et al., 2024). FA-4, optimized for Blackwell (B200), introduces a multi-warpgroups "ping-pong" pipeline exploiting faster tensor cores and relatively slower SMEM/exp units, employs software-emulated exponentials (via FMA polynomials) and aggressive TMEM partitioning, reducing global atomics in the backward pass via 2-CTA MMA. Up to 71% of hardware peak (1613 TFLOPs/s BF16) is achieved, surpassing the performance of cuDNN and Triton (Zadouri et al., 5 Mar 2026).
Compiler-Level Fusion: Systems such as FlashLight (PyTorch/TorchInductor integration) automatically fuse user attention code into single-triton/CUDA kernels, exploiting the same schedule and fusion strategies as hand-tuned FlashAttention-2/3/4, supporting a wide array of attention variants without static templates (You et al., 3 Nov 2025). FlashLight consistently matches or outperforms static kernel libraries (FlexAttention), even for complex dynamic masks, row/column gating, and Evoformer-style blocks.

5. Sparse, Masked, and Variant-Aware Extensions

FlashAttention’s tiled and fused structure enables efficient support for rich forms of mask and sparsity:

FlashMask (Wang et al., 2024): Introduces a column-wise continuous interval encoding of masks by four $O(N)$ vectors (LTS, LTE, UTS, UTE), permitting efficient block-level pruning in the kernel for arbitrary mask types. This yields $O(N)$ mask storage and up to $3.22\times$ end-to-end speedup vs dense masking, consistently outperforming flexible/block-based competitors (e.g., FlexAttention).
Block-Sparse/Top-k Pruning: Block-Sparse FlashAttention (BSFA) applies per-tile gating by evaluating the max entry of each $Q_i K_j^\top$ tile against pre-calibrated, per-head, per-block thresholds, skipping approximately 50% of $PV$ GEMMs and V-loads without retraining or approximation (Ohayon et al., 7 Dec 2025). Empirical results show speedup of 1.10–1.24 $\times$ with accuracy preservation above 99%.
General Structured and Hash-Based Sparsity: By extending tile-skipping logic and index bookkeeping, FlashAttention supports key/query dropping, hash-based attention (e.g., LSH/Reformer), and dynamic masking. Depending on pattern and fraction retained, up to $3.3\times$ speedup is realized for long contexts, with nonlinear scaling according to the block or bucket sparsity (Pagliardini et al., 2023).
Jagged (Variable-Length) Sequences: For non-uniform inputs, e.g., in recommendation systems with variable-length categorical features, Jagged FlashAttention processes "jagged tensors" (offset-indexed layouts) and exposes per-sample lengths directly to the fused kernel. This elimination of padding-induced compute/memory waste leads to up to $9\times$ speed and $22\times$ memory reductions over dense attention, preserving compatibility with high-throughput GPU tiling infrastructure (Xu et al., 2024).
Low-Rank Bias Integration: FlashBias embeds additive attention bias matrices, common in LLMs with position or context-specific bias, by low-rank factorization and concatenation to $Q$ and $K$ . Optimal IO and compute efficiency is attained when the bias rank $R \ll d$ , with empirical $2\times$ throughput improvements and $50\%$ memory savings in vision and AlphaFold-type workloads (Wu et al., 17 May 2025).

6. Analytical Models and Diagrammatic Performance Engineering

To structure, verify, and generalize these optimizations, diagrammatic and performance-model-driven approaches have been developed. Notably, performance models derived from circuit diagrams relabel axes according to SRAM/SMEM-register partitioning, extract DRAM bandwidth costs as a function of tile size and memory hierarchy, and predict achieved FLOPs utilization across quantized hierarchies (Abbott et al., 2024). This perspective synthesizes the originally heuristic kernel design process into analytically grounded, extensible recipes that directly map streaming, tiling, quantization, and memory-layout choices to expected speedup or resource usage.

Such models are validated empirically, with predicted $4$– $6\times$ DRAM bandwidth reduction, and accurately forecast regime changes as new hardware generations emerge (e.g., Blackwell's increased Tensor Core capacity outpacing SMEM or exponential unit scaling, forcing further algorithm/hardware co-designs).

7. Summary Table: Major FlashAttention Optimizations

Optimization	Technique Summary	Notable Results	Reference
IO-aware tiling	SRAM-resident tiling, online softmax fusion, streaming Q/K/V	$2\times$ – $6\times$ speedup	(Dao et al., 2022)
Parallelism & partition	Split-Q warp scheduling, per-row parallelism, algebraic fusion of normalization	$2\times$ over FA-1, $73\%$ A100 peak	(Dao, 2023)
Systolic array fusion	CMP units, in-PE exp, upward dataflow, per-element overlap	$1.77\times$ – $4.83\times$ ARR	(Lin et al., 15 Jul 2025)
3D vertical pipeline	Register pipeline via TSV, per-tier fusion, bubble-free cycle-level mapping	$80$– $93\%$ energy cut, $7.6\times$ perf	(Yu et al., 11 Feb 2026)
ExpMul/Log-domain HW	Fused $e^x\cdot v$ operator, or log-domain accumulators	$25$– $29\%$ area/power reduction	(Alexandridis et al., 20 May 2025, Alexandridis et al., 31 Oct 2025)
Asynchrony/wg pipeline	Producer/consumer warp specialization, two-stage pipelining, FP8 block quantization	$1.5$– $2\times$ (H100), $1.2$ PFLOPs/s	(Shah et al., 2024)
Block-level sparsity	Thresholded $S_{ij}$ at tile granularity, calibrated per-(layer,head,block)	$1.1$– $1.24\times$ speedup, $>99\%$ acc	(Ohayon et al., 7 Dec 2025)
Mask/jagged extension	$O(N)$ continuous interval masks, offset-based jagged tensor support, block skip logic	$3.2\times$ speedup ($64$K),~linear mem	(Wang et al., 2024, Xu et al., 2024)
INT8/FP8 quantization	Per-token/block quantype, token-level scaling, DP4A INT8 pipelined GEMMs	$72\%$ speed/ MRE $-82\%$ vs FP8/FP16	(Chen et al., 2024)
Compiler fusion	PyTorch-to-Triton/CUDA IR, algebraic fusion, autotiling	$1.1$– $5.2\times$ over Flex/t.compile	(You et al., 3 Nov 2025)

References

(Dao et al., 2022) FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
(Dao, 2023) FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
(Bikshandi et al., 2023) A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library
(Shah et al., 2024) FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
(Wang et al., 2024) FlashMask: Efficient and Rich Mask Extension of FlashAttention
(Abbott et al., 2024) FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness
(Xu et al., 2024) Enhancing Performance and Scalability of Large-Scale Recommendation Systems with Jagged Flash Attention
(Alexandridis et al., 20 May 2025) Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware Operators
(Alexandridis et al., 20 May 2025) FLASH-D: FlashAttention with Hidden Softmax Division
(Wu et al., 17 May 2025) FlashBias: Fast Computation of Attention with Bias
(Lin et al., 15 Jul 2025) SystolicAttention: Fusing FlashAttention within a Single Systolic Array
(Titopoulos et al., 8 Oct 2025) Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors
(Alexandridis et al., 31 Oct 2025) H-FA: A Hybrid Floating-Point and Logarithmic Approach to Hardware Accelerated FlashAttention
(You et al., 3 Nov 2025) Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
(Ohayon et al., 7 Dec 2025) Block Sparse Flash Attention
(Yu et al., 11 Feb 2026) From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design
(Zadouri et al., 5 Mar 2026) FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling
(Chen et al., 2024) INT-FlashAttention: Enabling Flash Attention for INT8 Quantization
(Pagliardini et al., 2023) Faster Causal Attention Over Large Sequences Through Sparse Flash Attention