Mixed-Precision Matrix Multipliers

Updated 10 December 2025

Mixed-precision matrix multipliers are computational engines that use low-precision formats for multipliers and high-precision formats for accumulation to enhance performance and stability.
They integrate configurable architectures like FEDP units and M4BRAM to optimize pipeline stages, enabling flexible precision settings for deep learning, HPC, and scientific applications.
Recent advances include adaptive runtime programmability and novel decomposition schemes, such as the Ozaki approach, to optimize throughput while managing resource trade-offs and numerical accuracy.

A mixed-precision matrix multiplier is a computational engine or algorithm that performs matrix multiplication where operands—input matrices and/or the accumulator—use varying data representations, such as combinations of low-precision floating-point formats (FP16, BF16, FP8, INT8, UINT4) for multipliers and higher-precision floating-point or integer (FP32, INT32) for accumulation. Mixed-precision multipliers are critical in high-performance computing, deep learning, and scientific computation, where they allow increased throughput and energy efficiency by leveraging hardware that supports multiple precision formats, while retaining numerical stability by accumulating results at higher precision.

1. Fundamental Architectures and Datapath Designs

The core architectural concepts for mixed-precision matrix multipliers typically involve a fusion of multipliers and accumulators supporting flexible operand widths and formats. The “Configurable Mixed-Precision Fused Dot Product” (FEDP) unit exemplifies this: it integrates floating-point (FP16, BF16, FP8, BF8) and integer (INT8, UINT4) multipliers with unified FP32/INT32 accumulation in a single four-stage pipeline (Rout et al., 19 Nov 2025).

Pipeline stages:

Multiply & Exponent Search: Low-precision multipliers (e.g., Wallace tree) process packed inputs; exponent search logic finds the shared exponent in O(1) time for FP formats.
Alignment: Products are aligned to the maximum exponent (FP) or sign-extended (INT) to a shared intermediate width for accumulation.
Accumulation (MOD-4 CSA): All products plus streamed addend are summed in a carry-save accumulator, optimized to minimize latency and hardware resources.
Normalize & Round/Final Concatenate: FP results undergo LZC/normalization/round-to-nearest-even; INT results are concatenated to form 32-bit sums.

This fused architecture eliminates redundant logic (e.g., exponents, adders), avoids external arbitration between separate MAC units, and can be tiled (e.g., 2×2 FEDPs per TCU for 4×4 MMA) for large-scale matrix multiplication engines (Rout et al., 19 Nov 2025).

On FPGAs, “M4BRAM” demonstrates bit-serial/bit-parallel MACs in block RAMs, supporting weight precisions of 2/4/8 bits and activation precisions from 2 to 8 bits. Multiple BRAM processing elements (BPEs) operate in parallel, accelerating mixed precision DNN inference (Chen et al., 2023).

SIMD and systolic array architectures have also incorporated mixed-precision. Asymmetric-operand SIMD instructions (e.g., 8-bit × 4-bit into 16-bit accumulators) can double MAC throughput and reduce bandwidth versus symmetric bit-width designs (Gope et al., 2020).

2. Supported Precision Formats and Configuration

Modern mixed-precision multipliers support a wide spectrum of input and accumulation formats, typically parameterized for extensibility:

Input Precision: FP16 (1-5-10), BF16 (1-8-7), FP8 and BF8 (1-4-3, 1-5-2), INT8 (signed), UINT4 (unsigned).
Accumulator Precision: FP32 (1-8-23), INT32 (32-bit two’s complement).

FEDP and M4BRAM datapaths are parameterized: exponent, mantissa, and accumulator widths are adaptable to new custom formats, including posit or novel floating-point slices, with format-select logic and easily adjusted muxing (Rout et al., 19 Nov 2025, Chen et al., 2023).

Runtime programmability is essential, especially for matrix engines serving DNN workloads, LLM inference, or HPC scientific kernels. The flexibility to switch precisions for per-layer, per-head, or per-tile computations is a cornerstone feature (Zhang et al., 20 Aug 2025, Zhang et al., 21 Aug 2025).

3. Algorithmic and Software Support

Matrix-multiply libraries such as BLIS implement comprehensive mixed-datatype (storage and computation) support by decoupling domain-mixing from precision-mixing. Packing stages perform typecasting of operands to the computation precision, allowing the microkernel to remain simple—only two microkernels per compute precision suffice (real and complex). This orthogonal decomposition prevents combinatorial intractability even as the number of domain- and precision-combinations grows (e.g., 128 total permutations) (Zee et al., 2019).

Adaptive frameworks for HPC and AI, such as the tile-centric, hardware-aware GEMM-MP, assign precisions at the tile or block level. These algorithms allow tiles to operate at the lowest precision that satisfies a prescribed error tolerance, mixing DP and SP (and potentially FP16 or INT8) across the matrix, with runtime orchestration (e.g., via PaRSEC) for optimal hardware utilization (Zhang et al., 20 Aug 2025).

In emerging LLM inference engines (e.g., TurboMind), matrix multipliers exploit offline-packed, hardware-aligned weights and runtime GEMM kernels, supporting INT4, INT8, and FP16 activations and weights. Bitwidth selection per tile or attention head is optimized to balance throughput versus accuracy loss (Zhang et al., 21 Aug 2025).

4. Performance, Resource Utilization, and Scaling

The performance of mixed-precision matrix multipliers is driven by pipeline fill rate, operand bit-widths, and architectural bandwidth:

FEDP: 4-cycle pipeline, 306.6 MHz (FPGA, Alveo U55C), 9.812 GFLOPS per FEDP in 4-thread-per-warp configuration, tiling up to ~157 GFLOPS per sub-core (Rout et al., 19 Nov 2025).
M4BRAM: 2.16× speedup on ImageNet DNNs at <0.5% accuracy loss, 1.43× higher throughput than prior compute-in-BRAM schemes, 1.98× better performance-per-area than DSP-based FPGAs (Chen et al., 2023).
BLIS: Mixed-datatype GEMM overhead is <5% for large matrices; achieves 96–98% of peak single/double-precision throughput on both Intel and ARM multicore platforms (Zee et al., 2019).
GEMM-MP (tile-centric): Linear scaling with precision-mix: on Fugaku, 0D:100S configuration achieves 8.0 Tflop/s (×2 vs. DP). On A100, 0D:100S matches SP-peak throughput (Zhang et al., 20 Aug 2025).
TurboMind: Mixed-precision pipelines provide up to 156% higher throughput versus baseline mixed-precision frameworks across multiple GPU generations (Zhang et al., 21 Aug 2025).

Resource trade-offs are often observed: LUT+FF resource increases on FPGA for mixed-precision units compared to hard DSP slices, but DSP block usage can be eliminated—a crucial consideration for kernel mixing (Rout et al., 19 Nov 2025, Chen et al., 2023). In bit-serial M4BRAM, BRAM overhead is a bottleneck at high replication, while parallelism (Nw×Ni) is the dominant scaling factor for throughput.

5. Numerical Behavior, Rounding, and Accumulator Semantics

The numerical semantics of mixed-precision matrix multiplies are governed by the properties of the input converter, alignment logic, accumulator width, and final rounding strategy:

Accumulator Structure: Internal accumulator bit-widths are larger than output formats to preserve extra bits for alignment and carry; e.g., binary16 inputs have effective accumulator widths of 14 bits on Ampere/Lovelace tensor cores (10 input + 3 carry + 1 alignment) (Khattak et al., 3 Sep 2025).
Rounding Modes: Most hardware employs deferred normalization and (for binary32 output) truncation both within and across block-FMA units, while binary16 output uses round-to-nearest-even. The specifics vary by vendor and architecture but are now well-characterized by systematic, device-agnostic test methodologies that probe rounding, normalization, and accumulator width for all combinations of input/output format (Khattak et al., 3 Sep 2025, Valpey et al., 21 Feb 2025).
Error Analysis: Algorithms using mixed low/high-precision operations must account for loss of bits due to accumulator and rounding. SMT formalization confirms the need for (e.g.) 3 extra carry bits for 5-term MMA accumulators in Volta/Turing and 4 for 9-term accumulators in Ampere (Valpey et al., 21 Feb 2025). Correction algorithms (e.g., Markidis-style residual schemes) can recover full FP32 accuracy from FP16 tensor cores only if they fit the hardware’s accumulator semantics precisely.

6. Algorithmic Advances and Performance-Accuracy Tradeoffs

Innovations such as the Ozaki scheme and integer-slice (block) decompositions enable high-accuracy matrix products using only low-precision kernel calls:

Ozaki Scheme: High-precision matrices are split into D low-precision chunks; all pairwise products are computed in fast S-bit arithmetic (e.g., INT8 or FP16) with results aggregated in high-precision (FP64, DD, TD). This achieves speedups of 2–10× versus native full-precision GEMM, up to moderate (e.g., 212-bit) precision, provided the number of chunks (D) is tuned to operand scaling (Utsugiri et al., 2023, Abdelfattah et al., 12 Jun 2025).
Block-Integer GEMM: Recasting FP matrix multiplication as a sum of exact block integer products, with tunable slice count for accuracy, allows exploitation of fast MMA units on GPUs. For poorly scaled matrices, required slice count grows, potentially offsetting performance gains (Abdelfattah et al., 12 Jun 2025).

Bit-serial and asymmetric SIMD engines further generalize the approach: by supporting elementwise multiplication of, e.g., 8 × 4-bit operands into 16-bit accumulators, these designs enable optimal lane packing and bandwidth utilization, achieving 2–4× MAC throughput improvements in DNN and accelerator contexts (Gope et al., 2020).

7. Application Domains and Practical Implementation

Mixed-precision matrix multipliers are deployed in a wide range of computational settings:

Deep Learning Inference and Training: DNNs exploit mixed-precision (e.g., INT8×FP16 accumulates to FP32) for improved kernel performance and reduced memory traffic. FPGA and GPU accelerators implement specialized engines (e.g., TCUs, M4BRAM).
LLM Serving: Mixed-precision GEMM pipelines with format-selective packing and per-head or per-layer bitwidth assignment are central to low-latency inference for large models, as realized in TurboMind (Zhang et al., 21 Aug 2025).
Scientific Computing: HPC workloads and scientific solvers use tile-wise and block-wise assignment of precision to maintain accuracy in ill-conditioned regions. Extended-H-matrix approaches dynamically select FP32 or FP64 in factorized sub-blocks to minimize compute and bandwidth without degrading solver convergence (Ooi et al., 2019, Zhang et al., 20 Aug 2025).
Easy Extensibility: Parameterization of logic and runtime precision assignment are now established best practices: adding a new format or changing assignment granularity entails minimal rework (e.g., a single register change or parameter update in the FEDP/TurboMind pipelines) (Rout et al., 19 Nov 2025, Zhang et al., 21 Aug 2025).