Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ozaki Scheme I for High-Precision GEMM

Updated 26 June 2026
  • Ozaki Scheme I is a decomposition-based algorithm that splits high-precision matrix multiplication into low-precision slices, enabling efficient aggregation of cross-products.
  • It methodically decomposes each operand into S-bit slices and uses optimized GEMM routines, with high-precision accumulation ensuring minimal rounding errors.
  • The algorithm leverages modern hardware advantages such as GPU tensor cores to achieve significant performance gains compared to traditional high-precision methods.

Ozaki Scheme I is a decomposition-based algorithm for computing high-precision matrix products using only multiple calls to highly optimized, lower-precision matrix multiplication routines. By slicing each operand into lower-precision “digits” or “slices,” the scheme expresses the exact or nearly exact high-precision result as a sum of low-precision cross-products, allowing the aggregation of modern hardware’s drastic performance advantage for small-precision GEMM units to solve high-precision scientific problems efficiently.

1. Principle and Mathematical Formulation

Given matrices ARm×kA \in \mathbb{R}^{m \times k} and BRk×nB \in \mathbb{R}^{k \times n}, both stored in a high (long) precision (mantissa LL bits), the core idea is to decompose each entry into sums of SS-bit slices: A=i=1DA(i),B=j=1DB(j),A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)}, with each A(i),B(j)A^{(i)}, B^{(j)} exactly representable in a short-precision format (mantissa SLS \ll L).

The matrix product is then evaluated as

C=AB=i=1Dj=1DA(i)B(j)=i,jGEMMS(A(i),B(j)),C = A B = \sum_{i=1}^D \sum_{j=1}^D A^{(i)} B^{(j)} = \sum_{i,j} {\rm GEMM}_S(A^{(i)}, B^{(j)}),

where each GEMMS{\rm GEMM}_S is performed in SS-bit precision, typically with highly efficient vendor BLAS or hardware tensor core routines. The accumulation into BRk×nB \in \mathbb{R}^{k \times n}0 is performed in high-precision arithmetic (exact in BRk×nB \in \mathbb{R}^{k \times n}1 bits or higher).

The parameter BRk×nB \in \mathbb{R}^{k \times n}2 is the number of slices per operand. For full-precision results, it must satisfy

BRk×nB \in \mathbb{R}^{k \times n}3

ensuring that products of slices do not overlap and each product is computed without rounding error in BRk×nB \in \mathbb{R}^{k \times n}4-bit arithmetic (Utsugiri et al., 2023).

2. Decomposition Algorithm (Slicing)

Any high-precision scalar or matrix entry BRk×nB \in \mathbb{R}^{k \times n}5 is recursively split:

  • BRk×nB \in \mathbb{R}^{k \times n}6
  • For BRk×nB \in \mathbb{R}^{k \times n}7:
    • BRk×nB \in \mathbb{R}^{k \times n}8 round or truncate BRk×nB \in \mathbb{R}^{k \times n}9 to LL0 bits (short-precision)
    • LL1 (computed in high precision)
  • LL2

This process is repeated for all entries. The blockwise (matrix) version applies the same logic row- or column-wise.

For hardware integer units (e.g., INT8 tensor cores), slices correspond to blocks of bits in the mantissa, often with a scaling factor per row/column to ensure exponent alignment. The decomposition can also leverage two-level blocking for memory efficiency (Mukunoki, 1 Aug 2025, Ootomo et al., 2023, Lu et al., 24 Jun 2026).

3. Computational Workflow and Pseudocode

The generic Ozaki Scheme I pseudocode is:

A(i),B(j)A^{(i)}, B^{(j)}7 (Utsugiri et al., 2023, Schwarz et al., 16 Nov 2025).

In fused-kernel (GPU) implementations, all cross-products and their accumulation are performed in a single kernel, eliminating off-chip round-trips for partial sums and maximizing arithmetic intensity (Lu et al., 24 Jun 2026).

4. Error Analysis and Parameter Choice

The approximation error comprises:

  • Truncation error from residual slices, bounded by LL3.
  • Low-precision GEMM error from each slice product; if the cross-product is exact (i.e., no rounding because of sufficiently small inputs and large enough accumulator/range), this is zero.
  • Accumulation error from summing LL4 terms in high precision.

The forward error satisfies: LL5 Choosing LL6 as above ensures the error is within the target precision's machine epsilon.

For hardware with limited accumulator precision (e.g., INT32 in INT8 x INT8 → INT32 TCs), the slice width is constrained by accumulator size, and the number of slices LL7 must respect overflow bounds (Lu et al., 24 Jun 2026, Ootomo et al., 2023). Exponent-span-based estimators further refine LL8 for diverse input distributions (Schwarz et al., 16 Nov 2025).

5. Hardware Mapping, Extensions, and Implementational Variants

Scenario Slicing Format Kernel Type Accumulation Notes
FP64 → FP16/FP8 gemms Float (e.g. 16b) cuBLAS/TC FP16/FP8 FP64 Biased for hardware throughput
FP64 → INT8 TCs Integer (8b) IMMA (INT8×INT8→INT32) FP64 Exponent alignment for range
MPFR arbitrary precision Float (e.g. 64b) High-precision SW BLAS MPFR Arbitrarily many slices
Fused GPU kernel INT8/FP8 (8b) Persistent/block kernel FP32/FP64 On-chip accumulation, optimized for minimal memory traffic (Lu et al., 24 Jun 2026)

Slice decomposition employs scaling strategies for range, and unsigned-integer slice encoding further reduces overhead (Schwarz et al., 16 Nov 2025). On modern GPU hardware, the triangular product schedule and in-register accumulation enable high utilization, with sustained performance up to 80%–90% of INT8 peak (Lu et al., 24 Jun 2026). FPGA, CPU (MKL), and custom quantum circuit frameworks employ similar schemes (Ootomo et al., 2023).

6. Complexity and Performance

Let LL9 denote matrix dimension, SS0 the high-precision mantissa bits, SS1 the slice precision, and SS2 the speedup factor of SS3-bit GEMM over SS4-bit GEMM. Then:

  • Classical high-precision GEMM: SS5 SS6-bit operations
  • Strassen: SS7
  • Ozaki I: SS8 calls to SS9 A=i=1DA(i),B=j=1DB(j),A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},0-bit GEMMs, plus A=i=1DA(i),B=j=1DB(j),A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},1 slicing/accumulation

The total leading cost is A=i=1DA(i),B=j=1DB(j),A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},2 for square matrices. For moderate A=i=1DA(i),B=j=1DB(j),A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},3 (e.g., A=i=1DA(i),B=j=1DB(j),A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},4), and A=i=1DA(i),B=j=1DB(j),A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},5 in the low thousands, Ozaki I frequently outperforms high-precision Strassen and direct software GEMMs, sometimes by factors exceeding A=i=1DA(i),B=j=1DB(j),A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},6 (Utsugiri et al., 2023, Kouya, 2023).

On GPUs with fused persistent-kernel implementations, Scheme I achieves throughput of up to A=i=1DA(i),B=j=1DB(j),A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},7 Top/s (Hopper, A=i=1DA(i),B=j=1DB(j),A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},8 INT8 peak) and A=i=1DA(i),B=j=1DB(j),A = \sum_{i=1}^D A^{(i)}, \qquad B = \sum_{j=1}^D B^{(j)},9 Top/s (Blackwell, A(i),B(j)A^{(i)}, B^{(j)}0) (Lu et al., 24 Jun 2026), with up to A(i),B(j)A^{(i)}, B^{(j)}1 (FP8TC) acceleration over hardware FP64 (Mukunoki, 1 Aug 2025).

7. Applicability, Extensions, and Limitations

Ozaki Scheme I generalizes to:

  • Complex matrices: Admits the 3M/4M strategies, combining slicing with minimal real GEMMs (Kouya, 2023).
  • LU decomposition and other BLAS: Enables efficient batched blocks for high-precision panel updates (Kouya, 2023).
  • FFT emulation: Underpins high-precision FFT by splitting input vectors/convolutions and summing low-precision transforms, including integer-based NTT/CRT for error-free accumulation (Kawakami et al., 31 Mar 2026).

Limitations:

  • For very high bit-widths (A(i),B(j)A^{(i)}, B^{(j)}2) or extremely large A(i),B(j)A^{(i)}, B^{(j)}3, A(i),B(j)A^{(i)}, B^{(j)}4 scaling becomes prohibitive, so Strassen or direct approaches dominate.
  • Fused-kernel memory footprint grows with A(i),B(j)A^{(i)}, B^{(j)}5; on-chip buffer budgets may force kernel tiling for large A(i),B(j)A^{(i)}, B^{(j)}6 on current GPU architectures (Lu et al., 24 Jun 2026).
  • For small input sizes, decomposition and pre/post-processing overheads can exceed the GEMM cost.
  • The method assumes reliable higher-precision accumulation on the host hardware; if not available, accuracy is not guaranteed.

Practical variants integrate automatic slice tuning (via exponent-span estimators), unsigned/signed residue tracking, and host-transparent fallback policies to ensure correctness across diverse input matrices (Schwarz et al., 16 Nov 2025).

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ozaki Scheme I.