Mixed-Precision BFP16–BF16 GEMM

Updated 27 November 2025

Mixed-Precision BFP16–BF16 GEMM is a matrix multiplication approach that uses BF16 for activations/outputs and BFP16 for weights to balance numerical fidelity with resource efficiency.
It leverages offline weight packing, swizzling, and asymmetric tile buffering to maximize arithmetic intensity and throughput in memory- and compute-bound environments.
Empirical results on NVIDIA GPUs and AMD AIE demonstrate 11–13× speedup over FP32 and up to 24.3 TFLOPS throughput, highlighting its potential for high-performance AI workloads.

Mixed-precision BFP16–BF16 GEMM refers to general matrix multiplication in which matrix operands are stored and processed using bfloat16 (BF16) and block floating point 16 (BFP16) numerical formats, with mixed usage tailored for high-throughput, resource-efficient AI inference and training. This approach leverages hybrid exponent/mantissa representations and architecture-specific tiling and buffering strategies—such as asymmetric tile buffering (ATB)—to maximize arithmetic intensity and throughput, especially in memory- and compute-bandwidth constrained environments.

1. Numerical Formats and Motivation

Mixed-precision arithmetic is essential for modern AI workloads to balance numerical fidelity against compute/memory resource constraints. BF16 provides 1 sign bit, 8 exponent bits, and 7 mantissa bits (2 bytes/element); it offers FP32-like range with reduced hardware footprint. BFP16 encodes a shared 8-bit exponent for every block of 8 elements and 1 byte mantissa per element (9 bytes/8 elements = 1.125 bytes/element), efficiently compressing weights for inference while maintaining dynamic range within each subblock.

In BFP16–BF16 GEMM workloads:

Activations $(A)$ and outputs $(C)$ are stored in BF16.
Weights $(B)$ are stored in BFP16.
Accumulation typically occurs in BF16 (or optionally FP32 for numerical stability).

This reduces memory traffic, exploits architecture-native MAC units (e.g., AMD XDNA2 AIE's 8×8 BFP16 MAC, 512 MACs/cycle), and retains training/inference accuracy (Zhang et al., 21 Aug 2025, Wang et al., 20 Nov 2025).

2. Offline Weight Packing and Reconstruction

The BFP16 packing scheme is applied offline at model load time. The FP32 weight matrix $W\in\mathbb{R}^{M\times K}$ is partitioned into $(R\times C)$ blocks (e.g., 16×16). Within each block $B$ :

Compute the block exponent $E_B = \max_{i,j} \lfloor \log_2 |W_{ij}| \rfloor$ .
Quantize each $W_{ij}$ to a $p$ -bit mantissa: $m_{ij} = \text{round}(W_{ij} \cdot 2^{p-1 - E_B})$ , $m_{ij} \in \{-2^{p-1},\dots,2^{p-1}-1\}$ .
Store as BFP16 $(E_B, m_{ij})$ .
Apply hardware-optimal “swizzling” with cp.async+ldmatrix to avoid coalescing and shared-memory conflicts.
Fragments are written so two kernel instructions can fully load a packed tile.

Online, BFP16 tiles are prefetched (cp.async), loaded to registers (ldmatrix), and expanded by integer-to-float (I2F) conversion per lane. Tensor-core mma.sync then performs tilewise GEMM in mixed precision. Double/triple buffering and instruction-level parallelism (mma/I2F/LD-ST pipelining on different tiles) maximize resource utilization (Zhang et al., 21 Aug 2025).

3. Tiling and Buffering Strategies

Baseline GEMM implementations use symmetric tile buffering—buffered input tile size along $M$ matches output tile size. For $A\in\mathbb{R}^{M\times K}$ , $B\in\mathbb{R}^{K\times N}$ , $C\in\mathbb{R}^{M\times N}$ , standard tiles take dimensions $(T_M, T_K, T_N)$ with buffer allocations:

Double-buffer $A$ , $B$
Single-buffer $C$

Arithmetic intensity (AI) for symmetric buffer: $\mathrm{AI} = \frac{2}{\frac{a}{T_N} + \frac{b}{T_M} + \frac{c}{K}}$ where $a$ , $b$ , $c$ are per-element sizes for $A$ , $B$ , $C$ (e.g., $a,c=2$ bytes for BF16, $b=1.125$ bytes for BFP16) (Wang et al., 20 Nov 2025).

Asymmetric tile buffering (ATB) decouples $A$ and $C$ tile sizes: buffer fewer rows of $A$ ( $T_{M_A}$ ) than of $C$ ( $T_{M_C}\ge T_{M_A}$ ), parameterizing the asymmetry ratio $\rho = T_{M_C}/T_{M_A} \ge 1$ . For ATB:

Larger $C$ tiles increase AI under the same buffer capacity.
The buffer constraint becomes: $\frac{2a}{\rho}T_{M_C}T_K + 2bT_KT_N + cT_{M_C}T_N \le S$ where $S$ is scratchpad capacity.

This suggests that choosing large $\rho$ and small $T_K$ enables higher AI and throughput under a fixed on-chip memory budget (Wang et al., 20 Nov 2025).

4. Performance Modeling and Optimization Trade-offs

End-to-end throughput is bounded by:

Memory-bound: $\mathrm{AI}_{array}\cdot \mathrm{BW}_{offchip}$
Compute-bound: $\mathrm{Eff}_{core}\cdot \mathrm{Perf}_{core}^{peak}\cdot n_{cores}$

$\mathrm{AI}_{array}$ post-L2→L1 broadcast is: $\mathrm{AI}_{\rm array} = \frac{2}{\frac{a}{8T_N^{L1}} + \frac{b}{4T_{M_C}^{L1}} + \frac{c}{K}}$

Core efficiency ( $\mathrm{Eff}_{core}$ ) combines microkernel efficiency ( $\mathrm{Eff}_{micro}$ ) and kernel-launch overhead $\delta \approx 50$ cycles: $\mathrm{Eff}_{\rm core} = \frac{1}{\frac{1}{\mathrm{Eff}_{\rm micro}} + \frac{\delta\rho\mathrm{Perf}_{\rm core}^{peak}}{2T_{M_C}^{L1}T_N^{L1}T_K^{L1}}}$ Maximizing $\mathrm{AI}_\rho$ (memory-bound) favors small $T_K$ , large $\rho$ . Maximizing $\mathrm{Eff}_{core}$ (compute-bound) favors large $T_K$ , small $\rho$ to lengthen steady-state compute and amortize kernel launch. The optimal configuration balances these competing effects (Wang et al., 20 Nov 2025).

5. Hardware Implementations and Microkernel Optimization

On AMD XDNA2™ AI Engine (AIE):

Each core features an 8×8 BFP16 MAC unit (512 MACs/cycle), 7-way VLIW, two vector-load units, one vector-store unit, 64 KB L1, plus larger L2 banks.
Tiling is performed at off-chip $\rightarrow$ L2 (symmetric), L2 $\rightarrow$ L1 (symmetric), and L1 $\rightarrow$ register (asymmetric).
Microkernel optimizations include register-level tiling, accumulator chaining, software-pipelined VMAC issues approaching pipeline depth ( $P=3$ ), double-buffered register loads, and input-sharing across chain clusters.

Manual scheduling and tiling achieves microkernel efficiency $\mathrm{Eff}_{micro}$ up to $0.63$ (of $1.84$ TFLOPS peak/core) (Wang et al., 20 Nov 2025).

On current GPU tensor-core architectures (NVIDIA H100, A100, L40S, RTX 4090), BFP16 weight packing paired with BF16 activations enables the streamlined data flow:

Model weights packed and swizzled offline for direct tensor-core consumption
Inference kernel double/triple buffers and tightly pipelines cp.async, ldmatrix, I2F expansion, and mma.sync
No run-time padding or in-register shuffling required (Zhang et al., 21 Aug 2025)

6. Comparative Performance and Experimental Results

TurboMind demonstrates BFP16–BF16 GEMM on four NVIDIA GPUs (single GEMM call for $32768\times32768$ matrices):

GPU	FP32×FP32	FP16×FP16	BFP16–BF16	Speedup: FP32	Speedup: FP16
4090	235 ms	38 ms	22 ms	×10.7 (91%)	×1.7 (42%)
L40S	198 ms	30 ms	17 ms	×11.6 (91%)	×1.8 (43%)
A100	185 ms	28 ms	16 ms	×11.6 (92%)	×1.8 (43%)
H100	162 ms	24 ms	13 ms	×12.5 (92%)	×1.8 (46%)

Compared to FP32, mixed-precision BFP16–BF16 yields approximately 11–13× speedup (90–92% latency reduction). Relative to FP16, it provides an additional 1.7–1.8× speedup by eliminating dequant overhead and doubling arithmetic intensity. These results align with roofline predictions for memory-bound kernels (Zhang et al., 21 Aug 2025).

On AMD XDNA2™ AIE:

MLIR-AIE baseline: 4.8 TFLOPS ($32$ cores) for BF16–BFP16.
ATB with $(T^{L1}=128,64,128, \rho=4)$ : 24.3 TFLOPS ( $4.54\times$ speedup), AI $_{array}$ rises from 216 to 410 op/B.
Peak throughput of 31.3 TFLOPS for all-BFP16 config and sustainment of $>19$ TFLOPS on small matrices.
Unmodified MLIR-AIE achieves only 0.32 TFLOPS/core; optimized ATB kernels reach 0.92 TFLOPS/core at $(T_K=64, \rho=4)$ (Wang et al., 20 Nov 2025).

7. Broader Implications and Future Directions

ATB is a general optimization applicable to any GEMM kernel constrained by on-chip buffer capacity—including NPUs, CPUs with scratchpad-like L1, and emerging many-core accelerators. The analytical performance model (combining arithmetic intensity and kernel/launch overhead) provides principled guidance on tile design, which may be incorporated into automated tile-search frameworks (e.g., AutoTVM, CUTLASS autotuning).

A plausible implication is that as AI models and hardware scale, mixed-precision strategies like BFP16–BF16 with ATB can remain effective bottleneck mitigators, especially in bandwidth-constrained regimes.

No direct experimental or architectural support for BF16 mixed-precision GEMM exists in (Lei et al., 23 Apr 2024), focusing solely on integer-based kernels. Current leading approaches for BFP16–BF16 GEMM leverage both software-level optimization (e.g., TurboMind’s packing and fused pipelining) and hardware architectural features (native BFP16 MAC units, custom tiling, streaming protocols) for maximal throughput and efficiency (Zhang et al., 21 Aug 2025, Wang et al., 20 Nov 2025).

Future work can integrate asymmetric tile buffering and block-format packing into industry-standard libraries, promoting these optimizations for both research and deployment contexts.