Warpgroup Matrix-Multiply-Accumulate (WGMMA)

Updated 3 May 2026

WGMMA is an advanced asynchronous GEMM primitive that leverages warpgroup-level execution (128 threads) on NVIDIA Hopper Tensor Cores for efficient large-tile matrix-multiply-accumulate operations.
It employs deep pipelining with explicit data movement from shared memory and fine-grained synchronization (using mbarrier) to optimally overlap computation and memory latency.
WGMMA significantly enhances throughput in both dense and sparse GEMM applications, enabling speedups in kernels such as SpMM and attention mechanisms in large language models.

Warpgroup Matrix-Multiply-Accumulate (WGMMA) is an advanced asynchronous General Matrix-Multiply (GEMM) primitive introduced on NVIDIA Hopper (SM90) Tensor Cores. It enables efficient, large-tile, warpgroup-cooperative matrix multiplications at high throughput, dramatically enhancing the performance of kernels that rely on dense or sparse GEMM, such as sparse matrix-matrix multiplication (SpMM) and attention mechanisms in LLMs (Liu et al., 20 Apr 2026, Bikshandi et al., 2023).

1. Definition and Instruction Architecture

WGMMA comprises a family of asynchronous matrix-multiply-accumulate instructions executed at the warpgroup level, specifically leveraging 128 threads—grouped as four contiguous warps. Each instruction, for example wgmma.mma_async.m<M>n<N>k<K>.f32.bf16.bf16, computes a $M \times N \times K$ tensor-core GEMM where inputs are typically in BF16 or FP16 and the accumulator is FP32. Unlike previous generations (Ampere WMMA), which required chaining multiple warp-sized (32 thread) operations to cover larger tiles, WGMMA can perform a full $64\times64\times16$ operation in a single atomic invocation (Liu et al., 20 Apr 2026, Bikshandi et al., 2023).

Operands may be sourced directly from shared memory (SMEM), reducing register pressure and facilitating high throughput. Execution is explicitly asynchronous, requiring ordering primitives (wgmma.fence, wgmma.commit_group, wgmma.wait_group) to manage compute–memory concurrency. This deep pipelining enables WGMMA to overlap math with explicit data movement orchestrated by the Tensor Memory Accelerator (TMA), hiding latency and maximizing resource utilization (Liu et al., 20 Apr 2026).

2. Mathematical Tiling and Data Movement

The mathematical operation for SpMM with WGMMA is expressed as:

$C_{\text{block}} = \sum_{t=0}^{k/16-1} A^{(t)} B^{(t)},$

where $A^{(t)} \in \mathbb{R}^{64 \times 16}$ and $B^{(t)} \in \mathbb{R}^{16 \times BN}$ are tiles loaded per K-slice iteration (with $k_0=16$ ), $C$ is the output tile, and $BN$ is selected to divide the output columns evenly (Liu et al., 20 Apr 2026). This formulation enables highly efficient partitioning and pipelining, a core requirement for maximizing GEMM throughput at scale.

WGMMA operations in attention (as in FlashAttention-2) generalize to

$C_{ij} \mathrel{+}{=} \sum_{p=0}^{bK-1} A_{ip} B_{jp} \quad (0 \leq i < bM,\, 0 \leq j < bN)$

with operand and tile shapes selectable at launch (e.g., $bM, bN \in \{64,128\}$ , $64\times64\times16$ 0), adapting to both register budgeting and softmax fusion strategies (Bikshandi et al., 2023).

3. Producer–Consumer Pipeline and Asynchronous Tiling

To fully exploit asynchrony and hide global memory latency, high-performance kernels (notably AsyncSparse BCSR) implement a producer–consumer pipeline using WGMMA in conjunction with TMA. The control logic organizes computation around a circular buffer in shared memory (e.g., 3 stages for BCSR), with one warpgroup dedicated to TMA data loads (producer), while the remaining warpgroups concurrently consume tiles via WGMMA (Liu et al., 20 Apr 2026). Lightweight synchronization objects (mbarrier, with phase bits) replace coarse grained __syncthreads() to ensure precise resource handover with minimal overhead.

This producer-consumer organization ensures that:

TMA loads for tile $64\times64\times16$ 1 are issued ahead of WGMMA computation for tile $64\times64\times16$ 2.
Consumers execute WGMMA on ready tiles while the producer loads the next K-slice.
Synchronization between producer and consumers is managed using mbarrier.wait() and mbarrier.arrive().
The pipeline enables overlap: for each tile, as soon as WGMMA begins on tile $64\times64\times16$ 3, TMA can begin loading tile $64\times64\times16$ 4.

A representative pseudocode illustrates this flow, explicitly identifying producer/consumer warpgroups, shared memory tiling, register allocation, and synchronization points as used in AsyncSparse BCSR kernels (Liu et al., 20 Apr 2026):

$C_{\text{block}} = \sum_{t=0}^{k/16-1} A^{(t)} B^{(t)},$ 9 (Liu et al., 20 Apr 2026)

4. Tile Parameterization, Data Layout, and Resource Management

Tile shapes and operand layouts are explicitly parameterized to maximize occupancy and minimize register spills. In SpMM, a pragmatic choice is $64\times64\times16$ 5, $64\times64\times16$ 6, with $64\times64\times16$ 7 selected for the target workload. In FlashAttention-2, $64\times64\times16$ 8 is selected based on the head dimension to balance register pressure: larger tiles improve performance but can cause register spills that hurt throughput (Bikshandi et al., 2023).

Operand layouts for SMEM are tightly coupled to GMMA atom layouts, with K-major 128-byte swizzling applied by TMA to eliminate bank conflicts (Bikshandi et al., 2023). Register allocation is dynamically managed via per-warpgroup setmaxnreg.inc/dec.sync, permitting multiple CTAs per SM when under the register file limit (e.g., 384 threads/CTA = 3 warpgroups). Producer warpgroups are register-light, while consumer warpgroups hold per-thread accumulators (e.g., for a $64\times64\times16$ 9 tile, each thread holds 32 FP32 values) (Liu et al., 20 Apr 2026).

5. Kernel Fusion, CUTLASS Integration, and Practical Implementation

WGMMA is critical for fusing multiple GEMM operations within a single CUDA kernel, especially where intermediate data dependencies incur significant memory traffic on prior GPUs. In the FlashAttention-2 implementation, back-to-back WGMMA dispatches are used: GEMM-I computes $C_{\text{block}} = \sum_{t=0}^{k/16-1} A^{(t)} B^{(t)},$ 0, followed by in-register softmax, and GEMM-II computes $C_{\text{block}} = \sum_{t=0}^{k/16-1} A^{(t)} B^{(t)},$ 1, all within a kernel using the CUTLASS layout/tensor abstractions, removing all intermediate global memory writes for $C_{\text{block}} = \sum_{t=0}^{k/16-1} A^{(t)} B^{(t)},$ 2 and $C_{\text{block}} = \sum_{t=0}^{k/16-1} A^{(t)} B^{(t)},$ 3 matrices (Bikshandi et al., 2023).

The pipeline is tightly controlled:

TMA loads next operand (e.g., $C_{\text{block}} = \sum_{t=0}^{k/16-1} A^{(t)} B^{(t)},$ 4) for GEMM-II while current GEMM-I executes.
Operand partitioning, register/local memory reshaping, and layout alignment are handled by CUTLASS primitives (partition_fragment_A/B/C, ReshapeTStoTP).
Only one warpgroup per CTA is used to avoid register file exhaustion.

Example invocation with CUTLASS:

auto smemLayoutQ = tile_to_shape(GMMA::Layout_K_SW128_Atom<MmaA>{}, tileShapeQ);
Tensor gQ = make_tensor(ptrQ, gmemLayoutQ);
auto tmaLoadQ = make_tma_copy(SM90_TMA_LOAD{}, gQ, smemLayoutQ, tileShapeQ, Int\<1>{});

using TiledMma0 = /* ... */;
TiledMma0 tiledMma0;

__global__ void fused_fmha_kernel(/* … */) {
  cfk::copy(tQgQ(_,0), tQsQ(_,0), tmaLoadQ, tma_mbar[0]);
  // ...
  cfk::gemm_bar_wait(tiledMma0, tSrQ, tSrK, tSrS, tma_mbar[0]);
}

(Bikshandi et al., 2023)

6. Performance Characteristics and System-Level Impact

Empirical ablation studies on SpMM (BF16, SuiteSparse, $C_{\text{block}} = \sum_{t=0}^{k/16-1} A^{(t)} B^{(t)},$ 5) quantify the performance contribution of WGMMA and its associated asynchrony:

Optimization	Throughput (TFLOPS)	Relative to Scalar	Description
opt0 (scalar)	0.12	0.08× cuSPARSE	Scalar FMA on CUDA core
opt1 (+WGMMA)	0.63	0.42×	Add WGMMA only
opt2 (+TMA)	2.37	1.56×	Add TMA overlapping
opt3 (+special)	6.44	4.31×	Add warp specialization

In end-to-end SuiteSparse benchmarks, the WCSR kernel leveraging WGMMA achieves 23.5 TFLOPS (4.86× cuSPARSE, 2.40× FlashSparse) at density $C_{\text{block}} = \sum_{t=0}^{k/16-1} A^{(t)} B^{(t)},$ 6 and $C_{\text{block}} = \sum_{t=0}^{k/16-1} A^{(t)} B^{(t)},$ 7. In LLM inference, for Qwen2.5-7B with 90% block sparsity, BCSR kernels achieve 1.58–1.98× speedup on the FFN projection versus dense cuBLAS. End-to-end prefill, after integrating both sparse attention (MInference) and sparse FFN, produces up to 2.66× system-level speedup at 64K tokens (Liu et al., 20 Apr 2026).

FlashAttention-2 kernels that utilize WGMMA on Hopper observe a 20–50% FLOPs/s boost over legacy Ampere-optimized implementations. The threefold improvement in GEMM throughput—enabled by large tiles and the 128-thread warpgroup—increases memory and compute utilization, provided tile shapes are chosen to avoid register oversubscription (Bikshandi et al., 2023).

7. Research Significance and Practical Considerations

WGMMA marks a critical shift in GPU programming for high-performance GEMM workloads:

Asynchronous, warpgroup-level dispatch with large tiles enables deep overlap between compute and data movement.
Bank-conflict-free layouts via hardware swizzling, and fine-grained mbarrier synchronization, facilitate higher SM occupancy.
Register management and kernel fusion enabled by WGMMA (and supported by CUTLASS and CuTe) allow single-kernel implementations of operations that previously required costly intermediate storage.
Performance scaling, even at low matrix density ( $C_{\text{block}} = \sum_{t=0}^{k/16-1} A^{(t)} B^{(t)},$ 8), establishes WGMMA as foundational to sparse and dense operations supporting contemporary LLM and scientific workloads.

A plausible implication is that, as workloads shift towards increasingly irregular or structured sparse data patterns and large-scale model inference, leveraging WGMMA-style primitives with explicit asynchrony, hardware-accelerated data movement (TMA), and cooperative threading will become standard in high-performance GPU kernels (Liu et al., 20 Apr 2026, Bikshandi et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures (2026)

A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Warpgroup Matrix-Multiply-Accumulate (WGMMA).