HGEMM: Half-Precision General Matrix Multiply

Updated 3 December 2025

HGEMM is a computational technique that performs matrix multiplication using FP16 arithmetic coupled with FP32 accumulation for enhanced precision.
It leverages specialized hardware like NVIDIA Tensor Cores, ARM/AMD NPUs, and FPGAs to maximize throughput and reduce energy usage.
Methodologies such as tile-based decomposition, iterative refinement, and RL-guided kernel tuning address numerical stability and optimize performance.

Half-precision General Matrix Multiply (HGEMM) refers to the computation of matrix products in which input matrices are represented in IEEE-754 half-precision format (FP16). HGEMM accelerates general matrix multiply workloads by leveraging specialized hardware and algorithmic techniques to exploit the reduced memory footprint and increased throughput possible with FP16 arithmetic, while addressing the numerical challenges inherent to low-precision representations.

1. Mathematical Model and Error Bounds

The canonical HGEMM operation evaluates $C \leftarrow \alpha A B + \beta C$ with $A \in \mathrm{FP16}^{M \times K}$ , $B \in \mathrm{FP16}^{K \times N}$ , and $C \in \mathrm{FP16}^{M \times N}$ or optionally $\mathrm{FP32}^{M \times N}$ for enhanced accumulation stability (Su et al., 2 Dec 2025). Representative tile-centric mixed-precision frameworks formalize the per-tile computation as:

$C^{(t)}_{ij} = \beta C^{(t)}_{ij} + \alpha \sum_{\ell=1}^{N_{\text{red}}} \bigl( A_{i\ell}^{\rm FP16} \otimes_{\rm FP16} B_{\ell j}^{\rm FP16} \bigr)\quad\text{with accumulation in FP32}$

The numerical error is bounded by

$\|\Delta C\| \leq \gamma_{Nb} \|A_h\| \|B_h\| + O(\epsilon_{FP32})$

where $\gamma_{Nb} = Nb \epsilon_{FP16} / (1 - Nb \epsilon_{FP16})$ , $\epsilon_{FP16} \approx 5 \times 10^{-4}$ , and accumulation in FP32 ensures error growth resembles single-precision GEMM except for an additional factor of $Nb$ (Zhang et al., 20 Aug 2025, Markidis et al., 2018). For truly large tiles, this penalty is amortized by tiling and task decomposition.

2. Hardware Architectures and Programming Models

HGEMM is enabled by architectural features that optimize for FP16 throughput and by dedicated matrix-multiply instruction sets:

NVIDIA GPU Tensor Cores: Each Volta, Ampere, or later Streaming Multiprocessor includes multiple 4×4 Tensor Cores, issuing FP16×FP16→FP32 matrix multiplies at warp granularity. Programming interfaces include WMMA API, CUTLASS templated library, and cuBLAS GEMM. cuBLAS supports HGEMM by enabling Tensor Core computation mode and correctly mapping FP16 input/output pointers and FP32 accumulators (Markidis et al., 2018, Su et al., 2 Dec 2025).
ARM/AMD/AI NPUs: BLIS/FP16, rocBLAS/HIP, and FP16-specialized cubes (e.g., Ascend 910A), with tile buffering and ping-pong double-buffered scheduling to maximize compute occupancy (Zhang et al., 20 Aug 2025, Xue et al., 31 Jul 2025).
FPGA/ASIC Generators: Parameterized hardware kernels (e.g., Ledoux and Casas (Ledoux et al., 2023)) allow precise adjustment of tile sizes, pipeline depths, accumulator widths, and datapath packing for efficient FP16 computation.

In all cases, memory layout and data-alignment (256-byte boundaries) are crucial for maximizing vector and Tensor Core utilization (Zhang et al., 20 Aug 2025, Markidis et al., 2018).

HGEMM can suffer significant rounding error due to FP16’s 10-bit mantissa and limited dynamic range (± $2^{15}$ ), especially in ill-conditioned or wide-dynamic-range matrix multiplies. Leading approaches mitigate this via:

FP32 Accumulation: Sum products in FP32 even when inputs are FP16; this reduces roundoff and catastrophic cancellation (Zhang et al., 20 Aug 2025, Markidis et al., 2018).
Residual Correction and Double Matrix Refinement: Perform additional GEMM passes to recover lost precision from FP16→FP32 conversion. For input matrices $A_s$ (FP32), compute $A_h = \mathrm{FP16}(A_s)$ , $R_A = A_s - A_h$ ; analogously for $B$ , accumulate $A_h B_h + R_A B_h + A_h R_B + R_A R_B$ in FP32 (Markidis et al., 2018). This reduces error by up to 10× at the cost of increased kernel calls.
Operand Decomposition: Map each FP32 operand to $(V_{\rm high}, V_{\rm low})$ as two FP16 numbers, with scaling factor $sf = 2^{s_b}$ for the low part to ensure $\geq 22$ bits mantissa recovery. Analyze and tune $s_b$ to balance underflow/overflow (Xue et al., 31 Jul 2025).
Iterative Refinement: For linear solvers, an HGEMM-based factorization can be followed by residual computation and higher-precision correction, restoring full single/double-precision accuracy in 1–2 sweeps (Zhang et al., 20 Aug 2025).

4. Tile-based Algorithmic Strategies and Scheduling

Tile/block decomposition underpins most HGEMM frameworks.

2D Block-cyclic Tiling: Partition $A,B,C$ into Nb×Nb tiles; deploy parameterized DAGs in runtime systems such as PaRSEC. Tile precision flags drive data-movement and kernel selection, enabling adaptive mixed-precision execution (Zhang et al., 20 Aug 2025).
Pipeline and Buffering: L1-aware blocking and double-buffer pipelines are used to overlap memory transfers and compute, increasing sustained throughput (e.g., $+57\%$ by ping-pong buffering on Ascend 910A) (Xue et al., 31 Jul 2025).
RL-guided Kernel Generation: CUDA-L2 utilizes LLMs combined with reinforcement learning to automatically search the HGEMM kernel space, optimizing parameters such as block/tile sizes, pipeline stages, swizzle strides, buffering, and epilogue style. This approach systematically identifies non-intuitive configurations yielding significant speedups over static libraries (Su et al., 2 Dec 2025).

Best practices include sizing tiles to saturate on-chip memory, precision tagging in the setup phase, alignment to maximize memory throughput, dynamic fallback to higher precision for poorly-conditioned tiles, and thorough validation against known error models.

5. Performance Results and Comparative Analysis

HGEMM offers marked improvements in FLOPS and energy usage over FP32 GEMM.

Architecture	FP16/FP32 Accum. HGEMM (TFLOPS)	FP32 GEMM (TFLOPS)	Energy/Throughput
NVIDIA A100	38 (Tensor Core)	19 (CUDA Core)	~40% less/FLOP
Tesla V100	up to 83 (Tensor Core)	~15 (CUDA Core)	—
Ascend 910A (NPU)	65.3 (double-buffered)	73.7 (B3 native)	—
FPGA (64 PEs @200MHz)	25.6 (FP16)	12.8 (FP32)	1.9× more eff.

HGEMM maintains ∼95% of achievable roofline bandwidth on A100 (Zhang et al., 20 Aug 2025). RL-tuned HGEMM kernels (CUDA-L2) yield +22% speedup over torch.matmul, +19.2% over cuBLAS, and up to +28.7% in server/inference mode (Su et al., 2 Dec 2025). Accuracy for well-conditioned random matrices is on the order of $10^{-3}$ – $10^{-2}$ in direct FP16, drops to single-precision baseline or better ( $\sim 10^{-7}$ ) when decomposition, scaling, or refinement techniques are applied (Xue et al., 31 Jul 2025, Zhang et al., 20 Aug 2025, Markidis et al., 2018).

6. Numerical Accuracy and Suitability by Application Domain

FP16 GEMM is well-suited to deep neural network workloads and inference pipelines where tolerance to rounding error and reduced dynamic range is acceptable; empirically many CNNs/MLPs incur <0.5% accuracy drop on ImageNet (Ledoux et al., 2023). For scientific/HPC workloads requiring stricter reproducibility and stability, FP32 accumulation and iterative refinement are generally necessary (Zhang et al., 20 Aug 2025, Markidis et al., 2018). Term-wise and structure-aware accumulation schemes can exceed FP32 accuracy in specific low-exponent regimes due to controlled summation order (Xue et al., 31 Jul 2025). Adaptive precision selection based on tile-local condition number is effective for balancing speed and reliability in mixed workloads (Zhang et al., 20 Aug 2025).

7. Implementation Guidelines and Future Directions

Key recommendations for HGEMM implementation include:

Always accumulate in FP32 or higher, even for FP16 inputs, to control worst-case error growth (Zhang et al., 20 Aug 2025, Markidis et al., 2018).
Configure tile sizes to maximize shared memory/cache usage; empirical choices in the range 1024–2048 yield optimal throughput (Zhang et al., 20 Aug 2025).
Align FP16 operand tiles on 256-byte boundaries (Markidis et al., 2018).
Employ fused MACs to avoid intermediate rounding (Ledoux et al., 2023).
Automate kernel parameterization and tuning using reinforcement learning and LLM guidance for performance-critical applications (Su et al., 2 Dec 2025).
When full accuracy is required, follow HGEMM with FP64 residual computation and iterative refinement (Zhang et al., 20 Aug 2025).
Select tile-wise precision dynamically in mixed workflows, relying on fast condition-number sketches (Zhang et al., 20 Aug 2025).

RL-driven approaches such as CUDA-L2 reveal that systematic exploration of kernel parameter spaces continues to yield non-trivial performance improvements even on highly-optimized platforms, pointing toward future systems that combine hardware-awareness with adaptive kernel generation (Su et al., 2 Dec 2025). Ongoing research seeks further automation for FP32/INT8, extension to Hopper/Blackwell architectures, and integration of real-time application feedback (Su et al., 2 Dec 2025).