TMA-Adaptive FP8 Grouped GEMM

Updated 8 March 2026

The paper introduces a dynamic TMA descriptor pool that eliminates padding overhead in FP8 grouped GEMM on NVIDIA Hopper GPUs.
It employs a dual-phase TMA load/store mechanism to handle variable matrix sizes while maintaining strict memory alignment without runtime allocation.
Experimental results demonstrate up to 20.4% speedup and 23.8% memory savings, crucial for efficient low-precision MoE model training and inference.

TMA-Adaptive FP8 Grouped GEMM is a kernel-level optimization for low-precision matrix multiplication on NVIDIA Hopper GPUs, which eliminates the padding overhead typically required for grouped general matrix-matrix multiplication (GEMM) using FP8 precision. By introducing a logarithmic-sized pool of preconfigured Tensor Memory Accelerator (TMA) descriptors and dual-phase memory load/store operations, this method dynamically adapts to variable group matrix dimensions without incurring extra memory or compute associated with conventional padding, while preserving strict alignment constraints imposed by Hopper’s hardware (Su et al., 7 Aug 2025).

1. Limitations of Conventional FP8 Grouped GEMM with Padding

Traditional FP8 grouped GEMM implementations, such as DeepGEMM, mandate that each expert group’s input and output matrices be padded to align their row count to a fixed multiple (commonly 128). This is necessitated by two hardware constraints: (1) TMA descriptors are static and not designed for group-wise variation in row counts, and (2) Hopper’s TMA enforces alignment requirements of 16 bytes for global memory addresses and 128 bytes for shared memory addresses during multidimensional transfers. As a consequence, groups often experience up to 127 additional rows per matrix, resulting in wasteful memory consumption (up to 23.8%) and computational slow-downs reaching 20% in extreme cases. The computational and bandwidth inefficiency becomes particularly acute as the number of groups increases and for smaller group sizes, due to the compounded effect of redundant reads, writes, and zero-element computation [(Su et al., 7 Aug 2025), Section 1, Figure 1].

2. Structure and Function of the TMA Descriptor Pool

TMA-Adaptive FP8 Grouped GEMM introduces an efficient method for handling arbitrary per-group matrix sizes through a logarithmic pool of TMA descriptors. For a tile shape defined by $block_M \times block_N$ , the descriptor pool is initialized at kernel launch:

$\mathcal{D}_{\mathrm{pool}} = \Bigl\{\,\bigl[2^i,\,block_N\bigr]\,\Big|\,i=0,1,\dots,\lfloor\log_2(block_M)\rfloor \Bigr\}$

(Section 2.2, Equation (1))

For each group $g$ during execution, the size of the residual rows is computed as:

$res^g = M^g \bmod block_M,\quad i = \lfloor \log_2 res^g \rfloor$

(Section 2.2, Equation (2))

The optimal descriptor for the residual is then dynamically selected via a single lookup into $\mathcal{D}_{\mathrm{pool}}$ , ensuring minimal overhead and memory traffic independent of group size. This construction guarantees comprehensive coverage for all possible residual sizes with merely $\lfloor \log_2(block_M) \rfloor + 1$ descriptors [(Su et al., 7 Aug 2025), Section 2.2].

3. Dual-Phase Load–Store Mechanism and Dynamic Descriptor Selection

Each group’s residual matrix portion is handled in exactly two TMA operations, regardless of $res^g$ :

The first phase (Phase A) copies the largest possible power-of-two block of rows from shared to global memory, starting at the group’s residual offset.
The second phase (Phase B) covers the remaining overlap, mapping the last $2^i$ shared rows to the end of the global matrix portion.

This ensures full coverage without holes or out-of-bounds accesses, as the overlapping region of $2\times 2^i - res^g$ rows resolves any boundary cases. The method relies on prebuilt descriptors, avoiding any runtime allocation. The compute phase itself (FP8 TensorCore GEMM with tiling across $block_M$ , $block_N$ , and $block_K$ ) proceeds as standard, interleaved with the adapted TMA stateless data transfers [(Su et al., 7 Aug 2025), Section 2.2, Appendix B, Figure 2].

4. Alignment-Compliant Memory Management

Strict adherence to Hopper TMA alignment restrictions is achieved through two key mechanisms (Section 2.3, Appendix A):

Global Memory: Every $S_A^g$ matrix row stride is enforced to be a multiple of $4\lceil K/128 \rceil$ bytes. If the starting address misaligns (not divisible by 16), extra upstream rows are fetched until alignment:

$\bigl(Addr_{SA}^g - 4\,row^g_{\mathrm{prev}}\lceil K/128\rceil\bigr)\bmod 16 = 0$

(Equation (3))

Shared Memory: Each block ensures $block_N \in \{64,128,192,\dots\}$ , so $2\times block_N$ is always 128-byte aligned. Both phases of TMA therefore always land on legal 128-byte boundaries, irrespective of the value of $res^g$ .

These mechanisms obviate the need for zero-padding on both the global and shared memory levels [(Su et al., 7 Aug 2025), Section 2.3, Appendix A].

5. Implementation Details on NVIDIA Hopper Architecture

TMA-Adaptive FP8 Grouped GEMM is deployed as a CUDA kernel tailored for NVIDIA H800 GPUs, using CUDA 12.6 and PyTorch 2.6.0. Each group maps to a single threadblock, often organized as 4 × 2 warps (8 warps) to utilize warp-group TMA. The descriptor pool is persistently stored as a constant-memory array. Runtime logarithm (for descriptor selection) is computed efficiently with the 31 - __clz(res) intrinsic. TMA API invocations include:

tmaDescCreate(&desc,...) for descriptor setup
tmaMemcpyAsync(...) for memory transfers
__syncwarp() and tmaWaitAll() for synchronization and ordering

FP8 TensorCore GEMM uses 1×128 and 128×128 scaling for compute [(Su et al., 7 Aug 2025), Section 3.1, Table 1].

6. Experimental Evaluation and Performance Outcomes

Evaluation against the DeepGEMM baseline (with explicit per-group padding to the 128 boundary) covered a comprehensive workload sweep: matrices with $N,K \in \{3072,4096,5120,6144,7168,8192\}$ , group counts $\{4,8,16,32\}$ , and total $M$ up to 65k, with random per-group sizes (Appendix C). Key empirical findings include:

Acceleration: Measured speedup over baseline ranged from 1.7% to 20.4%, with stronger effects for smaller $N$ and larger group counts. Acceleration correlated weakly positively with $M$ ( $r=0.096$ ), and strongly negatively with $N$ ( $r=-0.899$ ).
Memory Reduction: DRAM usage for $A$ , $S_A$ , and $C$ reduced by up to 23.8% (notably at $M = 8$ k, 32 groups). Memory savings inversely correlated with $M$ ( $r=-0.546$ ), and positively with group count ( $r=0.636$ ).
Numerical Equivalence: The result matrices exactly matched baseline results after removal of padded rows, confirming strict numerical fidelity of the dual-phase TMA approach (bitwise identical results for valid entries).

Summary table of key performance results:

Metric	Observed Value	Correlation
Speedup	1.7% – 20.4%	$+0.096$ with $M$ , $-0.899$ with $N$
Memory Saving	up to 23.8%	$-0.546$ with $M$ , $+0.636$ with group count
Numerical Error	Bitwise identical	–

[(Su et al., 7 Aug 2025), Section 3, Figure 1]

7. Applications in Low-Precision MoE Training and Inference

Grouped GEMM with variable $M^g$ per expert is fundamental for Mixture-of-Experts (MoE) architectures, particularly in recent LLMs where sequences are dynamically routed into specialist subnets. Key deployment scenarios include:

Inference: Dynamic batching produces widely differing $M^g$ across requests.
Training: Pipeline and tensor parallel strategies generate residual-sized matrices per group.

By removing the necessity for host- or kernel-side padding (previously consuming $>$ 2000 GB/s DRAM bandwidth in worst cases), TMA-Adaptive FP8 Grouped GEMM provides the following operational advantages:

Reduction in end-to-end inference latency due to fewer extra matrix rows.
Lower DRAM pressure, enabling increased maximum batch sizes or larger expert counts per GPU.
Drop-in compatibility with extant FP8 GEMM libraries, requiring only replacement of the grouped GEMM kernel with no changes to upstream host routing logic.

This architecture-compliant, zero-padding solution advances the state-of-the-art in low-precision MoE model training and inference for NVIDIA Hopper platforms, offering consistent improvements in both throughput (1.7%–20.4%) and memory utilization (up to 23.8%) without compromising accuracy [(Su et al., 7 Aug 2025), Section 4].

Markdown Report Issue Upgrade to Chat

References (1)

TMA-Adaptive FP8 Grouped GEMM: Eliminating Padding Requirements in Low-Precision Training and Inference on Hopper (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TMA-Adaptive FP8 Grouped GEMM.

TMA-Adaptive FP8 Grouped GEMM

1. Limitations of Conventional FP8 Grouped GEMM with Padding

2. Structure and Function of the TMA Descriptor Pool

3. Dual-Phase Load–Store Mechanism and Dynamic Descriptor Selection

4. Alignment-Compliant Memory Management

5. Implementation Details on NVIDIA Hopper Architecture

6. Experimental Evaluation and Performance Outcomes

7. Applications in Low-Precision MoE Training and Inference

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TMA-Adaptive FP8 Grouped GEMM

1. Limitations of Conventional FP8 Grouped GEMM with Padding

2. Structure and Function of the TMA Descriptor Pool

3. Dual-Phase Load–Store Mechanism and Dynamic Descriptor Selection

4. Alignment-Compliant Memory Management

5. Implementation Details on NVIDIA Hopper Architecture

6. Experimental Evaluation and Performance Outcomes

7. Applications in Low-Precision MoE Training and Inference

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research