Ozaki Scheme I for High-Precision GEMM
- Ozaki Scheme I is a decomposition-based algorithm that splits high-precision matrix multiplication into low-precision slices, enabling efficient aggregation of cross-products.
- It methodically decomposes each operand into S-bit slices and uses optimized GEMM routines, with high-precision accumulation ensuring minimal rounding errors.
- The algorithm leverages modern hardware advantages such as GPU tensor cores to achieve significant performance gains compared to traditional high-precision methods.
Ozaki Scheme I is a decomposition-based algorithm for computing high-precision matrix products using only multiple calls to highly optimized, lower-precision matrix multiplication routines. By slicing each operand into lower-precision “digits” or “slices,” the scheme expresses the exact or nearly exact high-precision result as a sum of low-precision cross-products, allowing the aggregation of modern hardware’s drastic performance advantage for small-precision GEMM units to solve high-precision scientific problems efficiently.
1. Principle and Mathematical Formulation
Given matrices and , both stored in a high (long) precision (mantissa bits), the core idea is to decompose each entry into sums of -bit slices: with each exactly representable in a short-precision format (mantissa ).
The matrix product is then evaluated as
where each is performed in -bit precision, typically with highly efficient vendor BLAS or hardware tensor core routines. The accumulation into 0 is performed in high-precision arithmetic (exact in 1 bits or higher).
The parameter 2 is the number of slices per operand. For full-precision results, it must satisfy
3
ensuring that products of slices do not overlap and each product is computed without rounding error in 4-bit arithmetic (Utsugiri et al., 2023).
2. Decomposition Algorithm (Slicing)
Any high-precision scalar or matrix entry 5 is recursively split:
- 6
- For 7:
- 8 round or truncate 9 to 0 bits (short-precision)
- 1 (computed in high precision)
- 2
This process is repeated for all entries. The blockwise (matrix) version applies the same logic row- or column-wise.
For hardware integer units (e.g., INT8 tensor cores), slices correspond to blocks of bits in the mantissa, often with a scaling factor per row/column to ensure exponent alignment. The decomposition can also leverage two-level blocking for memory efficiency (Mukunoki, 1 Aug 2025, Ootomo et al., 2023, Lu et al., 24 Jun 2026).
3. Computational Workflow and Pseudocode
The generic Ozaki Scheme I pseudocode is:
7 (Utsugiri et al., 2023, Schwarz et al., 16 Nov 2025).
In fused-kernel (GPU) implementations, all cross-products and their accumulation are performed in a single kernel, eliminating off-chip round-trips for partial sums and maximizing arithmetic intensity (Lu et al., 24 Jun 2026).
4. Error Analysis and Parameter Choice
The approximation error comprises:
- Truncation error from residual slices, bounded by 3.
- Low-precision GEMM error from each slice product; if the cross-product is exact (i.e., no rounding because of sufficiently small inputs and large enough accumulator/range), this is zero.
- Accumulation error from summing 4 terms in high precision.
The forward error satisfies: 5 Choosing 6 as above ensures the error is within the target precision's machine epsilon.
For hardware with limited accumulator precision (e.g., INT32 in INT8 x INT8 → INT32 TCs), the slice width is constrained by accumulator size, and the number of slices 7 must respect overflow bounds (Lu et al., 24 Jun 2026, Ootomo et al., 2023). Exponent-span-based estimators further refine 8 for diverse input distributions (Schwarz et al., 16 Nov 2025).
5. Hardware Mapping, Extensions, and Implementational Variants
| Scenario | Slicing Format | Kernel Type | Accumulation | Notes |
|---|---|---|---|---|
| FP64 → FP16/FP8 gemms | Float (e.g. 16b) | cuBLAS/TC FP16/FP8 | FP64 | Biased for hardware throughput |
| FP64 → INT8 TCs | Integer (8b) | IMMA (INT8×INT8→INT32) | FP64 | Exponent alignment for range |
| MPFR arbitrary precision | Float (e.g. 64b) | High-precision SW BLAS | MPFR | Arbitrarily many slices |
| Fused GPU kernel | INT8/FP8 (8b) | Persistent/block kernel | FP32/FP64 | On-chip accumulation, optimized for minimal memory traffic (Lu et al., 24 Jun 2026) |
Slice decomposition employs scaling strategies for range, and unsigned-integer slice encoding further reduces overhead (Schwarz et al., 16 Nov 2025). On modern GPU hardware, the triangular product schedule and in-register accumulation enable high utilization, with sustained performance up to 80%–90% of INT8 peak (Lu et al., 24 Jun 2026). FPGA, CPU (MKL), and custom quantum circuit frameworks employ similar schemes (Ootomo et al., 2023).
6. Complexity and Performance
Let 9 denote matrix dimension, 0 the high-precision mantissa bits, 1 the slice precision, and 2 the speedup factor of 3-bit GEMM over 4-bit GEMM. Then:
- Classical high-precision GEMM: 5 6-bit operations
- Strassen: 7
- Ozaki I: 8 calls to 9 0-bit GEMMs, plus 1 slicing/accumulation
The total leading cost is 2 for square matrices. For moderate 3 (e.g., 4), and 5 in the low thousands, Ozaki I frequently outperforms high-precision Strassen and direct software GEMMs, sometimes by factors exceeding 6 (Utsugiri et al., 2023, Kouya, 2023).
On GPUs with fused persistent-kernel implementations, Scheme I achieves throughput of up to 7 Top/s (Hopper, 8 INT8 peak) and 9 Top/s (Blackwell, 0) (Lu et al., 24 Jun 2026), with up to 1 (FP8TC) acceleration over hardware FP64 (Mukunoki, 1 Aug 2025).
7. Applicability, Extensions, and Limitations
Ozaki Scheme I generalizes to:
- Complex matrices: Admits the 3M/4M strategies, combining slicing with minimal real GEMMs (Kouya, 2023).
- LU decomposition and other BLAS: Enables efficient batched blocks for high-precision panel updates (Kouya, 2023).
- FFT emulation: Underpins high-precision FFT by splitting input vectors/convolutions and summing low-precision transforms, including integer-based NTT/CRT for error-free accumulation (Kawakami et al., 31 Mar 2026).
Limitations:
- For very high bit-widths (2) or extremely large 3, 4 scaling becomes prohibitive, so Strassen or direct approaches dominate.
- Fused-kernel memory footprint grows with 5; on-chip buffer budgets may force kernel tiling for large 6 on current GPU architectures (Lu et al., 24 Jun 2026).
- For small input sizes, decomposition and pre/post-processing overheads can exceed the GEMM cost.
- The method assumes reliable higher-precision accumulation on the host hardware; if not available, accuracy is not guaranteed.
Practical variants integrate automatic slice tuning (via exponent-span estimators), unsigned/signed residue tracking, and host-transparent fallback policies to ensure correctness across diverse input matrices (Schwarz et al., 16 Nov 2025).
References
- (Utsugiri et al., 2023) Acceleration of Multiple Precision Matrix Multiplication using Ozaki scheme
- (Kouya, 2023) Acceleration of complex matrix multiplication using arbitrary precision floating-point arithmetic
- (Mukunoki, 1 Aug 2025) DGEMM without FP64 Arithmetic -- using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme
- (Ootomo et al., 2023) DGEMM on Integer Matrix Multiplication Unit
- (Lu et al., 24 Jun 2026) EmuGEMM: Fused Tensor Core Kernels for Precision Emulation in Matrix Multiplication
- (Schwarz et al., 16 Nov 2025) Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme
- (Kawakami et al., 31 Mar 2026) Computing FFTs at Target Precision Using Lower-Precision FFTs
- (Terao et al., 22 Feb 2026) Forward Error-Oriented Iterative Refinement for Eigenvectors of a Real Symmetric Matrix