Ozaki Scheme I for High-Precision GEMM

Updated 26 June 2026

Ozaki Scheme I is a decomposition-based algorithm that splits high-precision matrix multiplication into low-precision slices, enabling efficient aggregation of cross-products.
It methodically decomposes each operand into S-bit slices and uses optimized GEMM routines, with high-precision accumulation ensuring minimal rounding errors.
The algorithm leverages modern hardware advantages such as GPU tensor cores to achieve significant performance gains compared to traditional high-precision methods.

Ozaki Scheme I is a decomposition-based algorithm for computing high-precision matrix products using only multiple calls to highly optimized, lower-precision matrix multiplication routines. By slicing each operand into lower-precision “digits” or “slices,” the scheme expresses the exact or nearly exact high-precision result as a sum of low-precision cross-products, allowing the aggregation of modern hardware’s drastic performance advantage for small-precision GEMM units to solve high-precision scientific problems efficiently.

1. Principle and Mathematical Formulation

Given matrices $A \in \mathbb{R}^{m \times k}$ and $B \in \mathbb{R}^{k \times n}$ , both stored in a high (long) precision (mantissa $L$ bits), the core idea is to decompose each entry into sums of $S$ -bit slices: $A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},$ with each $A^{(i)}, B^{(j)}$ exactly representable in a short-precision format (mantissa $S \ll L$ ).

The matrix product is then evaluated as

$C = A B = \sum_{i=1}^D \sum_{j=1}^D A^{(i)} B^{(j)} = \sum_{i,j} {\rm GEMM}_S(A^{(i)}, B^{(j)}),$

where each ${\rm GEMM}_S$ is performed in $S$ -bit precision, typically with highly efficient vendor BLAS or hardware tensor core routines. The accumulation into $B \in \mathbb{R}^{k \times n}$ 0 is performed in high-precision arithmetic (exact in $B \in \mathbb{R}^{k \times n}$ 1 bits or higher).

The parameter $B \in \mathbb{R}^{k \times n}$ 2 is the number of slices per operand. For full-precision results, it must satisfy

$B \in \mathbb{R}^{k \times n}$ 3

ensuring that products of slices do not overlap and each product is computed without rounding error in $B \in \mathbb{R}^{k \times n}$ 4-bit arithmetic (Utsugiri et al., 2023).

2. Decomposition Algorithm (Slicing)

Any high-precision scalar or matrix entry $B \in \mathbb{R}^{k \times n}$ 5 is recursively split:

$B \in \mathbb{R}^{k \times n}$ 6
For $B \in \mathbb{R}^{k \times n}$ $B \in R^{k \times n}$ 7:
- $B \in \mathbb{R}^{k \times n}$ 8 round or truncate $B \in \mathbb{R}^{k \times n}$ 9 to $L$ 0 bits (short-precision)
- $L$ 1 (computed in high precision)
$L$ 2

This process is repeated for all entries. The blockwise (matrix) version applies the same logic row- or column-wise.

For hardware integer units (e.g., INT8 tensor cores), slices correspond to blocks of bits in the mantissa, often with a scaling factor per row/column to ensure exponent alignment. The decomposition can also leverage two-level blocking for memory efficiency (Mukunoki, 1 Aug 2025, Ootomo et al., 2023, Lu et al., 24 Jun 2026).

3. Computational Workflow and Pseudocode

The generic Ozaki Scheme I pseudocode is:

$A^{(i)}, B^{(j)}$ 7 (Utsugiri et al., 2023, Schwarz et al., 16 Nov 2025).

In fused-kernel (GPU) implementations, all cross-products and their accumulation are performed in a single kernel, eliminating off-chip round-trips for partial sums and maximizing arithmetic intensity (Lu et al., 24 Jun 2026).

4. Error Analysis and Parameter Choice

The approximation error comprises:

Truncation error from residual slices, bounded by $L$ 3.
Low-precision GEMM error from each slice product; if the cross-product is exact (i.e., no rounding because of sufficiently small inputs and large enough accumulator/range), this is zero.
Accumulation error from summing $L$ 4 terms in high precision.

The forward error satisfies: $L$ 5 Choosing $L$ 6 as above ensures the error is within the target precision's machine epsilon.

For hardware with limited accumulator precision (e.g., INT32 in INT8 x INT8 → INT32 TCs), the slice width is constrained by accumulator size, and the number of slices $L$ 7 must respect overflow bounds (Lu et al., 24 Jun 2026, Ootomo et al., 2023). Exponent-span-based estimators further refine $L$ 8 for diverse input distributions (Schwarz et al., 16 Nov 2025).

5. Hardware Mapping, Extensions, and Implementational Variants

Scenario	Slicing Format	Kernel Type	Accumulation	Notes
FP64 → FP16/FP8 gemms	Float (e.g. 16b)	cuBLAS/TC FP16/FP8	FP64	Biased for hardware throughput
FP64 → INT8 TCs	Integer (8b)	IMMA (INT8×INT8→INT32)	FP64	Exponent alignment for range
MPFR arbitrary precision	Float (e.g. 64b)	High-precision SW BLAS	MPFR	Arbitrarily many slices
Fused GPU kernel	INT8/FP8 (8b)	Persistent/block kernel	FP32/FP64	On-chip accumulation, optimized for minimal memory traffic (Lu et al., 24 Jun 2026)

Slice decomposition employs scaling strategies for range, and unsigned-integer slice encoding further reduces overhead (Schwarz et al., 16 Nov 2025). On modern GPU hardware, the triangular product schedule and in-register accumulation enable high utilization, with sustained performance up to 80%–90% of INT8 peak (Lu et al., 24 Jun 2026). FPGA, CPU (MKL), and custom quantum circuit frameworks employ similar schemes (Ootomo et al., 2023).

6. Complexity and Performance

Let $L$ 9 denote matrix dimension, $S$ 0 the high-precision mantissa bits, $S$ 1 the slice precision, and $S$ 2 the speedup factor of $S$ 3-bit GEMM over $S$ 4-bit GEMM. Then:

Classical high-precision GEMM: $S$ 5 $S$ 6-bit operations
Strassen: $S$ 7
Ozaki I: $S$ 8 calls to $S$ 9 $A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},$ 0-bit GEMMs, plus $A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},$ 1 slicing/accumulation

The total leading cost is $A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},$ 2 for square matrices. For moderate $A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},$ 3 (e.g., $A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},$ 4), and $A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},$ 5 in the low thousands, Ozaki I frequently outperforms high-precision Strassen and direct software GEMMs, sometimes by factors exceeding $A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},$ 6 (Utsugiri et al., 2023, Kouya, 2023).

On GPUs with fused persistent-kernel implementations, Scheme I achieves throughput of up to $A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},$ 7 Top/s (Hopper, $A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},$ 8 INT8 peak) and $A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},$ 9 Top/s (Blackwell, $A^{(i)}, B^{(j)}$ 0) (Lu et al., 24 Jun 2026), with up to $A^{(i)}, B^{(j)}$ 1 (FP8TC) acceleration over hardware FP64 (Mukunoki, 1 Aug 2025).

7. Applicability, Extensions, and Limitations

Ozaki Scheme I generalizes to:

Complex matrices: Admits the 3M/4M strategies, combining slicing with minimal real GEMMs (Kouya, 2023).
LU decomposition and other BLAS: Enables efficient batched blocks for high-precision panel updates (Kouya, 2023).
FFT emulation: Underpins high-precision FFT by splitting input vectors/convolutions and summing low-precision transforms, including integer-based NTT/CRT for error-free accumulation (Kawakami et al., 31 Mar 2026).

Limitations:

For very high bit-widths ( $A^{(i)}, B^{(j)}$ 2) or extremely large $A^{(i)}, B^{(j)}$ 3, $A^{(i)}, B^{(j)}$ 4 scaling becomes prohibitive, so Strassen or direct approaches dominate.
Fused-kernel memory footprint grows with $A^{(i)}, B^{(j)}$ 5; on-chip buffer budgets may force kernel tiling for large $A^{(i)}, B^{(j)}$ 6 on current GPU architectures (Lu et al., 24 Jun 2026).
For small input sizes, decomposition and pre/post-processing overheads can exceed the GEMM cost.
The method assumes reliable higher-precision accumulation on the host hardware; if not available, accuracy is not guaranteed.

Practical variants integrate automatic slice tuning (via exponent-span estimators), unsigned/signed residue tracking, and host-transparent fallback policies to ensure correctness across diverse input matrices (Schwarz et al., 16 Nov 2025).

References

(Utsugiri et al., 2023) Acceleration of Multiple Precision Matrix Multiplication using Ozaki scheme
(Kouya, 2023) Acceleration of complex matrix multiplication using arbitrary precision floating-point arithmetic
(Mukunoki, 1 Aug 2025) DGEMM without FP64 Arithmetic -- using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme
(Ootomo et al., 2023) DGEMM on Integer Matrix Multiplication Unit
(Lu et al., 24 Jun 2026) EmuGEMM: Fused Tensor Core Kernels for Precision Emulation in Matrix Multiplication
(Schwarz et al., 16 Nov 2025) Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme
(Kawakami et al., 31 Mar 2026) Computing FFTs at Target Precision Using Lower-Precision FFTs
(Terao et al., 22 Feb 2026) Forward Error-Oriented Iterative Refinement for Eigenvectors of a Real Symmetric Matrix