Papers
Topics
Authors
Recent
2000 character limit reached

Ozaki-II Scheme: Emulating FP64 on Low-Precision Hardware

Updated 10 December 2025
  • Ozaki-II Scheme is a framework that emulates high-precision (FP64) matrix multiplication using low-precision kernels through decomposition techniques like CRT and slice-based expansion.
  • It leverages integer modular methods and quantized slicing to perform multiple low-precision GEMMs that, when reconstructed, yield results matching high-precision accuracy.
  • The scheme enhances performance on modern hardware, achieving speedups up to 16× and supporting applications in quantum chemistry, signal processing, and large-scale simulations.

The Ozaki-II Scheme is a framework for emulating high-precision matrix multiplication using only low-precision matrix multiplication kernels, such as integer or reduced-precision floating-point units. It is specifically designed to leverage the high computational throughput offered by modern AI accelerators and specialized hardware, allowing double-precision (FP64) or even higher-precision operations to be executed on hardware natively supporting only low-precision arithmetic. The central innovation is a decomposition and reconstruction strategy—using either integer modular techniques (Chinese Remainder Theorem, CRT) or low-precision slices—to ensure the output matches the precision and accuracy of true high-precision matrix multiplications, while greatly improving performance over direct high-precision implementations (Ozaki et al., 10 Apr 2025, Mukunoki, 1 Aug 2025, Schwarz et al., 16 Nov 2025, Uchino et al., 9 Dec 2025).

1. Key Concepts and Historical Context

The classical matrix-matrix multiplication (GEMM) is a cornerstone of numerical linear algebra. Standard BLAS implementations optimize GEMM for the highest accuracy and throughput permitted by the hardware. However, the transition of hardware architectures to favor low-precision matrix engines—such as INT8 or FP8 Tensor Cores—creates an acute mismatch for traditional scientific computing, which demands double-precision accuracy. The original Ozaki scheme ("Ozaki-I") addressed this by decomposing input matrices into multiple low-precision floating-point slices, performing all cross-products, and reconstructing the high-precision result. The Ozaki-II scheme, introduced by Ozaki, Uchino, and Imamura, generalizes and surpasses this paradigm by exploiting integer modular decompositions and CRT-based reconstruction, leading to superior scaling properties and hardware portability (Ozaki et al., 10 Apr 2025, Uchino et al., 9 Dec 2025).

2. Algorithmic Principles and Formulation

The Ozaki-II framework encompasses two principal methodologies:

  • CRT-Driven Modular Decomposition: Input matrices are first scaled into integer form using diagonal scaling matrices (with power-of-two scaling for exact reversibility). Each integer matrix is reduced modulo a set of pairwise-coprime moduli {m1,,ms}\{m_1,\dots,m_s\}, chosen to fit the target low-precision input width (e.g., mi256m_i\leq256 for INT8). Each pair of reduced matrices undergoes a matrix multiplication (GEMM) modulo mim_i. The set of outputs {C(i)}\{C^{(i)}\} is then combined using the Chinese Remainder Theorem to reconstruct the integer product, after which inverse scaling is applied to recover the floating-point result (Ozaki et al., 10 Apr 2025, Uchino et al., 9 Dec 2025):

C=D1XE1,Xi=1sC(i)Miyi(modM),C = D^{-1}\,X\,E^{-1},\quad X \equiv \sum_{i=1}^s C^{(i)} M_i y_i \pmod M,

where Mi=M/miM_i = M/m_i, yiMi1(modmi)y_i\equiv M_i^{-1} \pmod{m_i}, M=imiM = \prod_i m_i, and C(i)C^{(i)} is the modulo-mim_i GEMM product.

  • Slice-Based Low-Precision Expansion: Alternatively, for floating-point centric hardware (FP8/FP16 Tensor Cores), the inputs are expressed as sums of shifted, quantized blocks (“slices”) such that the accumulation of per-slice products and subsequent rescaling exactly replicates the high-precision output. This approach is carefully balanced to prevent any slice overflow while minimizing the decomposition overhead (Mukunoki, 1 Aug 2025, Schwarz et al., 16 Nov 2025).

Both routes can be further optimized via data-dependent heuristics, such as the Exponent-Span-Capacity (ESC) estimator, which predicts the minimal safe decomposition parameters for double-precision recovery (Schwarz et al., 16 Nov 2025).

3. Detailed Workflow and Pseudocode

A high-level outline of the Ozaki-II CRT-based algorithm is as follows (Ozaki et al., 10 Apr 2025, Uchino et al., 9 Dec 2025):

  1. Scaling: Define diagonal scaling matrices DD, EE (powers of two) to ensure A=trunc(DA)A' = \mathrm{trunc}(D A) and B=trunc(BE)B' = \mathrm{trunc}(B E) are integer matrices. The resulting C=D1(AB)E1C = D^{-1} (A'B') E^{-1} up to scaling.
  2. Reduction modulo coprime moduli: Select ss coprime integers m1,,msm_1,\dots,m_s. Form A(i)=AmodmiA^{(i)} = A' \bmod m_i and B(i)=BmodmiB^{(i)} = B' \bmod m_i, mapping entries to [mi/2,mi/2][-m_i/2, m_i/2].
  3. Low-precision GEMM products: Compute C(i)=A(i)B(i)C^{(i)} = A^{(i)} B^{(i)} for each ii by a low-precision GEMM kernel, ensuring no overflow.
  4. CRT reconstruction: Accumulate Z=i=1sC(i)MiyiZ = \sum_{i=1}^s C^{(i)} M_i y_i, and reduce X=ZmodMX = Z \bmod M.
  5. Inverse scaling: Recover C=D1XE1C = D^{-1} X E^{-1} in the desired floating-point representation.

For slice-based expansions, the steps involve partitioning inputs into fixed-point slices with controlled bitwidth, performing all slice-wise GEMMs, and rescaling/accumulating to recover the output (Mukunoki, 1 Aug 2025).

Pseudocode for these procedures is explicitly presented in (Ozaki et al., 10 Apr 2025, Uchino et al., 9 Dec 2025).

4. Complexity, Error Analysis, and Parameter Selection

A distinguishing property of Ozaki-II is linear scaling in the number of low-precision GEMMs with respect to the desired precision. If ss moduli are chosen, ss GEMMs are performed, compared to O(k2)O(k^2) in Ozaki-I for kk slices. The choice of ss is dictated by the product range: M=miM = \prod m_i must exceed twice the maximal entry magnitude in ABA' B' for a unique CRT reconstruction.

For exactness, it suffices that

(AB)jk<M/2,j,k,|(A'B')_{jk}| < M/2,\quad \forall j, k,

with mim_i bounded by hardware (e.g., mi256m_i\le 256 for INT8).

In terms of numerical error, the operations inside each modulus are exact integer arithmetic, and the only approximations stem from initial/trailing scaling and any floating-point conversion, which for power-of-two scaling and truncation yield exact reversibility. For slice-based schemes, the cumulative rounding error can be bounded by O(s2u)O(s^2 u), where uu is the unit roundoff for the low-precision datatype, and ss is adjusted according to the Exponent-Span-Capacity to guarantee the target accuracy (e.g., FP64-level) (Schwarz et al., 16 Nov 2025).

5. Hardware Implementation and Performance

Ozaki-II is designed to optimally employ low-precision hardware such as INT8 or FP8 Tensor Cores available on NVIDIA (GH200, B200, H100, RTX 4090, etc.) and modern AMD GPUs, or wide-issue FP64 SIMD units on CPUs. Empirical benchmarks consistently demonstrate substantial speedups:

  • On NVIDIA GH200 and RTX 4090, FP64 emulation with Ozaki-II achieves 7.4–9.8 TFLOPS (RTX 4090) and 56.6–80.2 TFLOPS (GH200), exceeding native FP64 rates by up to 16×\times on consumer GPUs (Ozaki et al., 10 Apr 2025).
  • For quadruple-precision emulation on CPUs, Ozaki-II outperforms traditional schemes by 1.6×1.6\times2.1×2.1\times (Ozaki et al., 10 Apr 2025).
  • Complex matrix multiplication emulated via CRT-based Ozaki-II and Karatsuba real embedding achieve 4.0×4.0\times6.5×6.5\times speedup over native routines on NVIDIA B200 (Uchino et al., 9 Dec 2025).
  • On Blackwell architecture, full DGEMM accuracy (FP64) is reached at over 80 TFLOPS in FP8 on hardware whose native FP64 is two orders of magnitude slower (Mukunoki, 1 Aug 2025).

A summary of performance results, including best-case speedups under different hardware and precision regimes, is shown in the following table (Ozaki et al., 10 Apr 2025, Uchino et al., 9 Dec 2025, Schwarz et al., 16 Nov 2025):

Hardware Mode Speedup (vs native FP64)
RTX 4090 Ozaki-II (INT8) 16×
GH200 Ozaki-II (INT8) 1.3×–1.4×
B200 Complex GEMM 4.0×–6.5×
Blackwell GB200 Ozaki-II (FP8) 2.3×
RTX 6000 Pro Ozaki-II (FP8) 13.2×

6. Variants, Extensions, and Practical Guidelines

Significant extensions to Ozaki-II enhance both breadth and efficiency:

  • Unsigned integer slicing: All but the sign-carrying slice are packed as unsigned, maximizing the number of effective mantissa bits per slice and reducing the total slice/GEMM count (Schwarz et al., 16 Nov 2025).
  • Automatic Dynamic Precision (ADP): ESC-based heuristics estimate the minimum safe number of slices required for FP64 accuracy at run time, reducing overhead and guaranteeing reliability across diverse matrix inputs and workloads (Schwarz et al., 16 Nov 2025).
  • Complex GEMM emulation: Via Karatsuba arithmetic and the CRT, complex single- and double-precision multiplication is achieved with modular INT8 kernels, sustaining high accuracy and throughput (Uchino et al., 9 Dec 2025).
  • Blocking optimization: Partitioning the kk-dimension ("k-blocking") for slice-based decomposition reduces memory use and improves cache/tensor core utilization (Mukunoki, 1 Aug 2025).

Practical deployment involves trade-offs: Ozaki-II is preferable on hardware with fast integer Tensor Cores or when requiring precision beyond double (e.g., quadruple). Ozaki-I may be preferable if using FP16/FP32 Tensor Cores and only moderate extra precision is needed. The number of moduli (ss) is the main tunable parameter, with typical values 14s1814\leq s \leq 18 for FP64 on GPUs, 20s2520\leq s \leq 25 for quad precision (Ozaki et al., 10 Apr 2025).

7. Applications and Impact

Ozaki-II methods are now a core component in high-performance scientific computing, particularly where hardware-imposed precision bottlenecks would otherwise constrain accuracy or throughput. They have demonstrated compelling results in quantum chemistry, quantum circuit simulation, signal processing, accelerated linear algebra, and any domain reliant on large-scale GEMM operations (Ootomo et al., 2023, Uchino et al., 9 Dec 2025). Integration within linear algebra libraries (e.g., cuSOLVER) enables transparent emulation of FP64 accuracy on AI-focused hardware, cementing the Ozaki-II scheme’s role as a foundational building block for forthcoming exascale scientific workloads (Schwarz et al., 16 Nov 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Ozaki-II Scheme.