Ozaki Scheme II: High-Precision Matrix Multiply

Updated 3 July 2026

Ozaki-II is a framework that uses the Chinese Remainder Theorem to reconstruct high-precision matrix products from multiple low-precision GEMM operations.
It decomposes the multiplication into scalable modular computations, enabling accelerated performance on hardware optimized for INT8/FP8 arithmetic.
Practical implementations show robust accuracy with exponential error decay and reduced memory traffic, making it key for HPC and AI-optimized architectures.

The Ozaki-II scheme is a matrix multiplication emulation framework that leverages the Chinese Remainder Theorem (CRT) to reconstruct high-precision (e.g., IEEE-754 double-precision) results from multiple low-precision (typically INT8 or FP8) matrix-multiply-accumulate operations. This approach transforms the challenge of achieving both high accuracy and high throughput—in the context of modern hardware architectures dominated by fast low-precision units—into a problem of parallel, modular arithmetic computation followed by exact reconstruction and rescaling. Originally developed for accelerating high-precision numerical workloads on low-precision compute engines, Ozaki-II has become foundational for contemporary high-performance computing (HPC) and numerical linear algebra on AI-optimized accelerators.

1. Mathematical Foundations and Algorithm Structure

Ozaki-II decomposes the target general matrix-matrix multiplication (GEMM), $C = AB$ , where both $A$ and $B$ are high-precision matrices, into a set of modular subproblems:

Scaling and Truncation: The inputs $A \in \mathbb{F}_b^{m \times k}$ and $B \in \mathbb{F}_b^{k \times n}$ are scaled by diagonal matrices (with exponents as integer powers of two) to convert them into bounded integer matrices $A'$ and $B'$ :

$A' = \operatorname{trunc}(\operatorname{diag}(2^{\mu}) A),\quad B' = \operatorname{trunc}(B \operatorname{diag}(2^{\nu}))$

The scaling factors $\mu, \nu$ are chosen such that $|A'_{ij}|, |B'_{ij}|$ are small enough for all modular products to be performed exactly in INT8 or FP8 with INT32/FP32 accumulation, and so that $A$ 0, where $A$ 1 is the CRT modulus (Uchino et al., 30 Jan 2026, Ozaki et al., 10 Apr 2025, Uchino et al., 11 Mar 2026).

Modular Product and GEMM: For each of $A$ 2 pairwise-coprime moduli $A$ 3 (for INT8) or up to 513 (for FP8), the integer matrices are reduced modulo $A$ 4:

$A$ 5

Then, the modular products $A$ 6 are computed using low-precision GEMM (INT8×INT8→INT32 or FP8×FP8→FP32) (Uchino et al., 30 Jan 2026, Uchino et al., 11 Mar 2026).

CRT Reconstruction: The full high-precision product is reconstructed via the Chinese Remainder Theorem—typically with precomputed modular inverses—to obtain the unique integer result modulo $A$ 7:

$A$ 8

with $A$ 9 (Uchino et al., 30 Jan 2026, Ozaki et al., 10 Apr 2025).

Inverse Rescaling: The output is rescaled back to floating-point:

$B$ 0

All errors originate in the integer truncation step; the modular product and reconstruction are exact provided the uniqueness criterion is satisfied (Uchino et al., 30 Jan 2026).

2. Accuracy, Error Analysis, and Parameter Selection

Deterministic error bounds for Ozaki-II are directly linked to the number of modular products $B$ 1 and the dynamic range of the scaled input matrices. The leading error term decays exponentially with $B$ 2, as the CRT modulus $B$ 3 grows like $B$ 4 (if $B$ 5):

$B$ 6

where $B$ 7, and the terms $B$ 8 incorporate input exponent spans and rounding errors arising in scaling and reference matmuls (Uchino et al., 30 Jan 2026).

To achieve relative error $B$ 9, one needs: $A \in \mathbb{F}_b^{m \times k}$ 0 For single-precision ( $A \in \mathbb{F}_b^{m \times k}$ 1), $A \in \mathbb{F}_b^{m \times k}$ 2– $A \in \mathbb{F}_b^{m \times k}$ 3 is sufficient; for double-precision ( $A \in \mathbb{F}_b^{m \times k}$ 4), $A \in \mathbb{F}_b^{m \times k}$ 5– $A \in \mathbb{F}_b^{m \times k}$ 6 is typical.

The exponent-range terms affect the error pre-factors but do not alter the exponential scaling in $A \in \mathbb{F}_b^{m \times k}$ 7. Thus, matrices with large dynamic range (wide exponent distribution) require more moduli only for extreme accuracy demands (Uchino et al., 30 Jan 2026, Kawakami et al., 28 Jun 2026). This behavior is confirmed experimentally, e.g., a relative error $A \in \mathbb{F}_b^{m \times k}$ 8 is attainable with $A \in \mathbb{F}_b^{m \times k}$ 9 even for exponent spreads as large as $B \in \mathbb{F}_b^{k \times n}$ 0 (Uchino et al., 30 Jan 2026).

3. Algorithmic Modes and Scaling Variants

Ozaki-II supports multiple modes for scaling factor determination:

Accurate Mode: Performs an auxiliary INT8 (or FP8) GEMM to estimate norm bounds tightly. This mode ensures the tightest possible bounds on integer operands but incurs an $B \in \mathbb{F}_b^{k \times n}$ 1 cost for the additional matmul (Kawakami et al., 28 Jun 2026).
Fast Mode (Cauchy–Schwarz Bound): Avoids the reference GEMM, using analytic bounds (row/column $B \in \mathbb{F}_b^{k \times n}$ 2 norms) computed in $B \in \mathbb{F}_b^{k \times n}$ 3. However, the original fast mode lacks scale invariance, potentially causing CRT failures for rescaled inputs. An improved, scale-invariant fast mode has been derived to guarantee CRT safety without extra cost and is the recommended default (Kawakami et al., 28 Jun 2026).
Hybrid/FP8-Specific Variants: For FP8 units, moduli up to 513 are supported using a Karatsuba-style two-fold integer splitting or a modular-reduction trick for perfect-square moduli. The number of required modular GEMMs is $B \in \mathbb{F}_b^{k \times n}$ 4 for fast mode and $B \in \mathbb{F}_b^{k \times n}$ 5 for accurate mode, with $B \in \mathbb{F}_b^{k \times n}$ 6 as low as 12–13 for full FP64 emulation (Uchino et al., 11 Mar 2026).

Mode	Overhead	Typical $B \in \mathbb{F}_b^{k \times n}$ 7 for FP64	Accuracy guarantee	Remarks
Accurate	$B \in \mathbb{F}_b^{k \times n}$ 8	10–13	Tightest bound on $B \in \mathbb{F}_b^{k \times n}$ 9	One extra reference GEMM
Fast (orig.)	$A'$ 0	10–13	Not scale-invariant	May fail for rescales
Fast (improved)	$A'$ 1	10–13	Scale-invariant, CRT-safe	No extra overhead

For complex matrices, Ozaki-II uses either full block-matrix embedding (doubling the inner dimension) or, preferably, the “3M” Karatsuba trick: computing three real Ozaki-II GEMMs per modulus to reconstruct the complex product (Uchino et al., 9 Dec 2025, Lu et al., 24 Jun 2026).

4. Practical Implementations, Optimization, and Performance

On GPU architectures with INT8 or FP8 matrix-multiply hardware (e.g., NVIDIA Hopper, Blackwell), Ozaki-II enables significant speedups over native FP64 GEMM:

INT8/FP8 implementations achieve throughput of $A'$ 2– $A'$ 3 PFLOPS for FP32 and FP64 accuracy on NVIDIA RTX 4090 and B200, compared to $A'$ 4– $A'$ 5 PFLOPS for peak native arithmetic (Uchino et al., 30 Jan 2026, Uchino et al., 11 Mar 2026, Lu et al., 24 Jun 2026).
Complex GEMM via Ozaki-II achieves $A'$ 6– $A'$ 7 speedup over native cuBLAS ZGEMM in double and single precision (Uchino et al., 9 Dec 2025, Lu et al., 24 Jun 2026).
Memory traffic is greatly reduced by in-register modular reductions and fused kernel designs. EmuGEMM, for example, eliminates global-memory round-trips for all intermediate INT32 results and reuses a single accumulator for 3M complex products (Lu et al., 24 Jun 2026).
Scalability and robustness: The algorithm is well-suited to batched/big matrix workloads, with workspace and memory complexity linear in the problem size and number of moduli (Uchino et al., 11 Mar 2026).

Performance is limited primarily by the number of modular GEMMs ( $A'$ 8), the reconstruction phase (negligible for large tile sizes), and memory bandwidth. The latency of the CRT/Garner reconstruction can be amortized or eliminated via batched, register-fused approaches (Matsuoka, 28 May 2026, Lu et al., 24 Jun 2026). Ozaki-II emulation can match or exceed the so-called “memory roof” for FP64 workloads even when hardware FP64 throughput is drastically reduced by architectural decisions.

5. Application Domains and Extensions

Ozaki Scheme II is broadly adopted wherever high-precision GEMM is required but hardware supports high-throughput only for low-precision matrix engines. Key applications encompass:

HPC Kernels: Emulated DGEMM, SpMV, GEMV, stencils, and FFT. In particular, the Ozaki-Bailey FFT reformulation maintains FP64 accuracy for 3D-FFTs on GPUs lacking high-throughput FP64 vector units (Matsuoka, 28 May 2026).
Mixed-precision scientific workloads: Serving as a default BLAS kernel or update routine in QR/LU factorizations, achieving full FP64 accuracy and grading guarantees (i.e., Demmel–Higham “grade A” tests) (Lu et al., 24 Jun 2026, Uchino et al., 9 Dec 2025).
Arbitrary-precision emulation: Ozaki-II generalizes to MPFR/MPBLAS environments, where “block splitting” approaches trade $A'$ 9 low-precision GEMMs for a single large high-precision GEMM (effective to $B'$ 0– $B'$ 1 bits) (Kouya, 2023).
Emerging hardware: On platforms where native INT8 throughput is reduced (NVIDIA Blackwell Ultra, Rubin R200), the FP8 variant—with Karatsuba and square-modulus optimizations—achieves comparable or superior performance and accuracy (Uchino et al., 11 Mar 2026, Matsuoka, 28 May 2026).

6. Key Insights, Limitations, and Best Practices

Comparison to Ozaki-I: Whereas Ozaki Scheme I requires $B'$ 2 “slice” products for full FP64 accuracy (e.g., 121 FP8 matmuls for $B'$ 3 slices), Ozaki-II achieves the same with $B'$ 4 modular products (e.g., 36 FP8 matmuls for $B'$ 5). This reduction is particularly significant for accelerators with a large $B'$ 6 (Ozaki et al., 10 Apr 2025, Uchino et al., 11 Mar 2026).
Scaling and Uniqueness Condition: Ensuring $B'$ 7 is crucial. Practical implementations provide both analytic (fast, scale-invariant) and reference-matmul (accurate) approaches, with the former recommended for throughput and robustness (Kawakami et al., 28 Jun 2026).
FP8-specific tricks: For FP8 tensor-core hardware, a hybrid algorithm using Karatsuba splits and modular reduction for perfect squares is required to lift the modulus size, minimizing the total count of GEMMs and DRAM footprint (Uchino et al., 11 Mar 2026).
Limitations and Edge Cases: For extremely ill-conditioned inputs or very small $B'$ 8, the CRT overhead and workspace can become non-negligible. INT32 overflow must be avoided; for large $B'$ 9, blocking in the $A' = \operatorname{trunc}(\operatorname{diag}(2^{\mu}) A),\quad B' = \operatorname{trunc}(B \operatorname{diag}(2^{\nu}))$ 0-dimension may be necessary (Kawakami et al., 28 Jun 2026).
Reproducibility and Memory: Ensuring bitwise-identical results with modular arithmetic is straightforward. However, peak workspace can be substantial for large $A' = \operatorname{trunc}(\operatorname{diag}(2^{\mu}) A),\quad B' = \operatorname{trunc}(B \operatorname{diag}(2^{\nu}))$ 1 or big matrix sizes; blocking and pipelining are employed in practice (Uchino et al., 11 Mar 2026).

7. Outlook and Impact on the Hardware–Software Co-Design Landscape

Ozaki-II is increasingly regarded as the canonical bridge from high-precision numerical science to AI-optimized accelerators. On architectures where FP64 silicon is de-emphasized, emulation via Ozaki-II restores memory-wall and compute-bound performance for all canonical dense and sparse linear algebra kernels, making FP8 and INT8 sufficient for production HPC workloads (Matsuoka, 28 May 2026, Matsuoka, 28 May 2026). The underlying Tensor–Memory Equilibrium model formalizes the attainable ceiling given the number of modular GEMMs, memory traffic, and CRT reduction latency. Further research directions include deeper reductions in workspace for FP8 modular arithmetic, extensions to batched and sparse kernels, and improved automation of scaling factor selection to guarantee CRT recovery while minimizing overhead (Kawakami et al., 28 Jun 2026, Uchino et al., 11 Mar 2026).

The adoption of Ozaki-II has led to the obsolescence of the assumption that hardware FP64 is mandatory for double-precision scientific computing. It has enabled software-based recovery of peak throughput on emerging AI and scientific architectures and unlocked new lines of algorithm–hardware co-design. The approach is now foundational in both academic BLAS research and production HPC libraries.