FlashAttention Kernels Overview

Updated 30 March 2026

FlashAttention Kernels are optimized attention operators that fuse IO-aware tiling and on-chip computations to dramatically reduce high-bandwidth memory traffic.
They leverage advanced hardware mapping techniques, such as TensorCores and asynchronous pipelining, to boost throughput by 3–7× and reduce latency.
Extensions to FlashAttention include quantization, sparsity, and flexible masking, enabling scalable and efficient Transformer models for long-context applications.

FlashAttention Kernels are a class of highly optimized attention operators designed for efficient execution of Transformer self-attention and its variants on modern hardware. They fuse the memory-intensive matrix operations and nonlinearities of attention into single, tiled kernels, minimizing high-bandwidth memory (HBM) traffic by exploiting the GPU's memory hierarchy. Since their introduction, FlashAttention kernels have become the foundation of high-throughput, long-context Transformer models and are now extended across quantization, sparsity, algorithm–hardware co-design, and flexible variant support.

1. IO-Aware Tiling and Core Algorithmic Principles

The original FlashAttention kernel is designed around IO-aware computation. Instead of the naïve strategy of materializing the $N\times N$ attention scores and applying softmax and value projections in separate steps, FlashAttention processes tiles of the query (Q), key (K), and value (V) matrices that fit on the GPU's on-chip SRAM (shared memory). Tiling enables each Q, K, V tile to be loaded from HBM only once, and all intermediate attention score ( $S=QK^\top$ ), softmax, and output ( $O = \mathrm{softmax}(S)V$ ) computations are completed on-chip. The algorithm maintains per-tile running row-wise maximum ( $m$ ) and sum ( $\ell$ ) to ensure numerically stable online softmax and fuses all three main steps (QK $^\top$ , softmax, $P V$ ) into a single kernel (Dao et al., 2022).

FlashAttention thus avoids the $O(N^2)$ HBM reads and writes required by conventional implementations, achieving nearly optimal I/O complexity of $O(N^2 d^2/M)$ , where $M$ is the available SRAM. In practice, this reduces memory bottlenecks, improves throughput by 3–7×, and unlocks transformer models with long context windows.

2. Fused Kernel Architectures and Hardware Mapping

FlashAttention’s expansion includes several kernel and hardware mapping innovations:

TensorCore and DP4A-based GEMMs: On NVIDIA hardware, FlashAttention kernels leverage TensorCores for half-precision (FP16/BF16) GEMMs, and, in quantized variants (e.g., INT8/INT4), DP4A-style integer GEMMs, fully exploiting hardware (Chen et al., 2024).
CUTLASS and CuTe-DSL: On Hopper (H100) and Blackwell (B100/B200) GPUs, high-performance fused kernels are generated using CUTLASS or CuTe-DSL abstractions, which allow expressing memory layouts, asynchronous transfers (TMA), warpgroup MMAs (WGMMA), and ping-pong accumulator scheduling. In FlashAttention-4, multi-warpgroup scheduling exploits fully asynchronous MMA operations and TMEM (tensor memory) for pipelined matmul–softmax–matmul fusion, with up to 1.6 PFLOP/s on B200 (Zadouri et al., 5 Mar 2026, Bikshandi et al., 2023).
Custom Exponential and Multiplication Units: To further accelerate kernel execution, hardware units for fused exponential and vector multiplication ("ExpMul") have been designed, which reduce area and power by folding $e^x V$ as a bit-manipulation on FP exponent fields (Alexandridis et al., 20 May 2025). Vectorized implementations for RISC-V processors use low-cost radix-2 exponential approximations to avoid scalar instruction bottlenecks (Titopoulos et al., 8 Oct 2025).
Quantized FlashAttention: INT-FlashAttention introduces a fully INT8-tiled attention pipeline. All Q, K, V activations are quantized per-token or per-tensor, and scale factors are tracked in HBM. Online softmax bookkeeping remains in FP32 for numerical precision. INT8 GEMMs halve memory footprints and, on Ampere GPUs, reduce attention kernel latency by up to 72% relative to FP16, with up to 82% lower quantization error than FP8 block-quantized baselines (Chen et al., 2024).

3. Advanced Memory Hierarchy and Scheduling

Advanced variants exploit GPU cache and memory subsystems:

L2 Cache Optimization: On large-scale GPUs with sizable L2 caches (e.g., NVIDIA GB10), L1 cache is often bypassed during streaming K/V tile loads. Sawtooth Wavefront Reordering alternates K/V scan directions per Q tile to minimize L2 thrashing, halving L2 cache misses and boosting throughput up to 60% (Zhu et al., 22 Jan 2026).
Pipeline and Warp Specialization: FlashAttention-4 employs fully asynchronous MMAs, three-way warpgroup specialization, and overlapping TMA transfers. The pipeline stages Q, K, V, and output tiles through shared and tensor memory buffers, with the softmax concurrently executing on the previous tile while the next is matmul-accumulated (Zadouri et al., 5 Mar 2026).
2-CTA MMA Mode and DSMEM: To further reduce shared memory traffic and atomic adds, pairs of CTAs cooperate on MMAs, splitting tile boundaries and sharing operands via DSMEM, as in the dQ step of FlashAttention-4's backward kernel.

4. Extension to Masking, Sparsity, and Attention Variants

Block Sparse and Structured Masking: FlashAttention kernels can be extended to block-sparse attention via tile-masked skipping, avoiding unnecessary QK $^\top$ matmuls for fully masked regions. FlashMask introduces a column-wise interval mask representation that achieves $O(N)$ memory usage and enables efficient masking in the kernel inner loop. For typical LLM alignment tasks, FlashMask attains 1.7–3.2× speedups over dense-masked FlashAttention, and outpaces FlexAttention in kernel TFLOPs/s (Wang et al., 2024).
Unified Sparse Attention: FlashOmni encodes arbitrary sparsity patterns (feature caching and block-sparse skipping) into compact uint8 symbols, driving both sparse attention and GEMM-Q/GEMM-O stages from a unified multi-purpose CUDA kernel. FlashOmni achieves near-linear speedups tied to the sparsity ratio, with end-to-end acceleration in large Diffusion Transformers and visual quality preservation (Qiao et al., 29 Sep 2025).
Sparse Grouped Query Kernels: Flash Sparse Attention inverts Native Sparse Attention’s kernel structure for small grouped query counts, inverting kernel launch order and decoupling online softmax statistics from partial output accumulation. This unlocks practical speedup for popular LLMs, with up to 3.5× lower latency and 1.09–1.25× end-to-end speedup over NSA (Yan et al., 25 Aug 2025).
Flexible Kernel Programming Models: FlexAttention and Flashlight provide programming abstractions that enable users to express arbitrary masking or bias logic in Python, which compilers (TorchDynamo, TorchInductor, Triton) lower to efficient fused kernels. Flashlight supports arbitrary attention patterns, fusing loop-based max/sum reductions and flexible tiling, with performance competitive with or superior to hand-tuned template-based kernels (You et al., 3 Nov 2025, Dong et al., 2024).

5. Algorithmic and Numerical Modifications

Online Softmax and Numerical Stability: FlashAttention and descendants use online max-plus softmax reduction to avoid overflow and underflow, retaining only row-wise maxima/sums or their logarithmic equivalents. In quantized kernels (e.g., INT-FlashAttention), all softmax reductions, max, and sum are carried out in FP32 even though matmuls are INT8 (Chen et al., 2024).
Softmax Division and Kernel Simplicity: FLASH-D shows softmax normalization can be hidden in a recurrent elementwise sigmoid, obviating explicit running-max tracking and sum-division, simplifying both kernel logic and hardware datapath. This reduces ASIC area and power by 23% and 20%, respectively, with identical throughput (Alexandridis et al., 20 May 2025).
Emulated Exponential and Conditional Scaling: On hardware with slow exponential units (Blackwell), software-emulated exponentials use Horner’s rule for degree-3 polynomials via FMA units; integer parts are handled by FP bit-manipulation (Zadouri et al., 5 Mar 2026). Conditional scaling applies the full rescale factor $e^{m_{j-1}-m_j}$ only if the running max difference exceeds a hardware-tuned threshold, eliminating most per-block rescale multiplies.

6. Empirical Performance and Scaling Behavior

Empirical results confirm the scalability and efficiency of FlashAttention kernels across hardware and application domains:

INT-FlashAttention on RTX 4090 achieves 72% lower latency and 82% smaller quantization error relative to FP16 FlashAttention (Chen et al., 2024).
FlashAttention-4 on B200 attains 1.3x speedup over cuDNN and 2.7x over Triton, with sustained 1613 TFLOPs/s (71% of peak) using large tiles and asynchronous pipelining (Zadouri et al., 5 Mar 2026).
FlashMask achieves 230–240 TFLOPs/s on A100 (BFloat16) with up to +60% kernel throughput versus FlexAttention (Wang et al., 2024).
Hardware accelerators with ExpMul units reduce area by 28.8% and power by 17.6% versus baseline floating-point FlashAttention cores, at parity for latency and accuracy (Alexandridis et al., 20 May 2025).
Sawtooth Wavefront Reordering on GB10 reduces L2 misses by 50–67% and increases throughput by up to 60%, portable across recent NVIDIA architectures (Zhu et al., 22 Jan 2026).

7. Compatibility, Portability, and Future Directions

Quantized and Mixed-Precision Compatibility: FlashAttention-style kernels support INT8, INT4, and mixed-precision (e.g., “half-INT8” with Q,K INT8 and V FP16) via per-token or per-block scaling logic. Extensions to finer-grained V quantization can further improve accuracy (Chen et al., 2024).
Mask and Variant Support: FlexAttention, Flashlight, and FlashMask frameworks enable easy composition of mask types, positional bias, and even paged-cache indirections, significantly broadening the range of attention patterns usable with memory- and compute-optimized kernels (You et al., 3 Nov 2025, Dong et al., 2024, Wang et al., 2024).
Hardware and Algorithmic Co-design: The FlashAttention kernel family continues to evolve with new GPU and accelerator architectures. Kernel–pipeline co-design for Blackwell-class GPUs, use of distributed shared memory (DSMEM), and programmable arithmetic units (polynomial exponentials, fused exp-muls) exemplify this trend (Zadouri et al., 5 Mar 2026, Alexandridis et al., 20 May 2025).
Scalability and Application-specific Extensions: From super-resolution transformers with large windows (using rank-factorized implicit bias for FlashAttention compatibility) (Lee et al., 6 Mar 2026) to diffusion models with multi-granularity block-sparse execution (Qiao et al., 29 Sep 2025), FlashAttention kernels remain pivotal for next-generation large-scale, long-context, and application-tailored Transformer models.