Split Integer Emulation of FP64
- Split integer emulation of FP64 is a method that decomposes double-precision numbers into low-precision components, enabling reconstruction of FP64 accuracy using specialized hardware.
- It employs techniques such as table lookup, slicing with GEMM, and CRT-based methods to manage precision and reduce memory overhead in large-scale computations.
- The approach delivers notable performance and energy efficiency gains on modern GPUs, while addressing challenges like error accumulation and limited dynamic range.
Split integer emulation of FP64 refers to a class of numerical methods in which double-precision floating-point arithmetic (IEEE 754, 64-bit) is emulated by decomposing FP64 values and operations into lower-precision integer or floating-point computations, then recombining the results using algorithmic procedures. These techniques exploit high-throughput, low-precision hardware units—such as INT8/FP8 Tensor Cores on modern GPUs—to efficiently perform large-scale computations that require FP64 accuracy. The result is often substantial performance and energy benefits, with application to scientific computing, dense matrix operations, and numerical linear algebra.
1. Foundational Principles
The central idea of split integer FP64 emulation is to represent each FP64 number (and, by extension, FP64 operations) as a structured combination of low-precision segments—typically 8- or 16-bit integers or floating-point blocks—which are processed independently using hardware-accelerated routines. The emulation relies on error-free transformations or controlled-sum algorithms, such that the final recombination step restores the full FP64 precision.
A canonical form for splittings is: where are the split components (e.g., INT8 values) and is a power-of-two scaling factor. For two numbers and , products can be expanded as: with each computed in low-precision and later accumulated using high-precision or error-free techniques (Luszczek et al., 28 Sep 2025, Ootomo et al., 2023).
2. Split Emulation Methodologies
2.1 Table-Lookup Emulation for Subsets
For data files storing numbers with limited significant digits, a subset of FP64 values is exactly representable in a compact 32-bit form. The technique involves storing the sign, exponent, and high-order mantissa bits and reconstructing the remaining mantissa bits from a lookup table parameterized by selected bitfields (Neal, 2015). Encoding is a direct copy of the upper 32 bits; decoding is via a fast table lookup using a subset of mantissa/exponent bits: This enables rapid, memory-efficient storage and fast reconstruction of FP64 values for large arrays, particularly in scientific data with limited precision.
2.2 Slicing and GEMM-based Emulation
For dense matrix multiplication, the Ozaki scheme and its variants ("shared-place" or "multi-slice") systematically split input matrices , into components, mapping the mantissa via bit-level or fixed-point transformations. Each slice is processed using low-precision units (e.g., INT8 Tensor Cores), with the products scaled and summed according to block exponents (Ootomo et al., 2023). The generic computation is: where denotes element-wise scaling and is selected based on accumulator round-off and inner-product length.
2.3 Modular and CRT-based Methods
Ozaki Scheme II generalizes emulation via the Chinese Remainder Theorem (CRT). Matrices are first scaled to integers (via diagonal matrices , ), then reduced modulo pairwise coprime integers . Modular products are computed using INT8 or other supported tensor core units. CRT is used to reconstruct the full product modulo : where , , and the final step inverts the original scaling to map results back to the FP64 domain (Ozaki et al., 10 Apr 2025).
3. Hardware Mapping and Performance
Split FP64 emulation leverages dedicated low/high-throughput units as follows:
| Hardware Unit | Role in Emulation | Key Benefits |
|---|---|---|
| INT8/INT32 Tensor Core | Matrix multiplications for slices | Throughput, energy savings |
| FP8/FP16 Tensor Core | Alternative for slicing & GEMM | Wider dynamic range (FP8) |
| CPU SIMD INT ops | Quadruple or extended precision | Efficient for small/special sizes |
On GPUs such as NVIDIA Hopper or Blackwell, INT8-based emulation achieves FP64-equivalent DGEMM throughput significantly higher than vendor-supplied FP64 BLAS, e.g., 7.4–9.8 TFLOPS on NVIDIA RTX 4090 and 56.6–80.2 TFLOPS on GH200 (Ozaki Scheme II) (Ozaki et al., 10 Apr 2025). Integer-based approaches also reduce memory overhead by shared exponents and block operations, with working memory reductions of up to 50–75% compared to conventional FP16/FP32 methods (Ootomo et al., 2023). Power consumption is reduced due to the lower energy cost per operation of INT8 units (Luszczek et al., 28 Sep 2025).
4. Numerical Properties and Challenges
Split integer emulation introduces nonstandard numerical characteristics relative to hardware FP64:
- Quantization and Range: INT8/FP8 segments have limited dynamic range. The splitting strategy must prevent overflow, and scaling must preserve significant digits under worst-case input spread (Luszczek et al., 28 Sep 2025).
- Accumulation Errors: Summing partial products in low-precision can lead to cancellation, especially for ill-conditioned problems. High-precision (often FP64) accumulation or block-wise compensation is required.
- Error Propagation: Multiple slices and recombinations increase complexity of tracking rounding error. Error-free or compensated summation helps mitigate accumulation of numerical errors.
- Modulo Constraints (CRT-based): The modular product space must exceed twice the dynamic range of intermediate results to ensure uniqueness in CRT recovery: (Ozaki et al., 10 Apr 2025).
Benchmark experiments on well-chosen input ranges confirm FP64-level accuracy in practice, provided splitting and scaling are adapted to the input’s exponent distribution (Luszczek et al., 28 Sep 2025).
5. Algorithmic Variants and Optimization Strategies
Several algorithmic choices enhance emulation effectiveness:
- Bitwise Slicing vs. Block-Float: Bitwise splitting isolates mantissa digits; block-float strategies use per-block exponents to compress redundant representation, maximizing bits per slice (Ootomo et al., 2023).
- Blocking and Partitioning: Inner-product-wise blocking, e.g., dividing the inner (k) dimension of GEMM, reduces redundant slicing and controls the number of GEMM calls (Mukunoki, 1 Aug 2025).
- Number of Slices/Moduli: Ozaki Scheme II reduces the number of GEMMs by using modular arithmetic (16 GEMMs for high precision vs. 28–35 in earlier schemes with 7–8 slices) (Ozaki et al., 10 Apr 2025).
- Hybrid Precision: FP8-based slicing (E4M3) with FP32 accumulation can reduce the slice count with less overhead compared to INT8-only approaches (Mukunoki, 1 Aug 2025).
- Fusion and Pipeline: Pipeline fusion of multiplication and recombination operations matches the parallel processing model of high-end GPUs and exploits their memory bandwidth (Luszczek et al., 28 Sep 2025).
6. Applications and Impact
Split integer FP64 emulation is deployed in domains that require high arithmetic precision but can profit from the hardware acceleration designed for ML or AI workloads:
- Dense Matrix Computations: High-performance linear algebra (DGEMM), dense linear solvers, and decompositional factorizations are accelerated by INT8 emulation, particularly beneficial for quantum chemistry and simulation workloads (Ootomo et al., 2023, Ozaki et al., 10 Apr 2025, Luszczek et al., 28 Sep 2025).
- Scientific Computing Libraries: Integration with BLAS/LAPACK or statistical languages (e.g., R) enables transparent handling of large arrays, compressing storage by up to 50% where the subset representation applies (Neal, 2015).
- Hybrid HPC/ML Workloads: Scenarios requiring both energy efficiency and occasional FP64 computations—such as embedded or edge platforms performing scientific inference—benefit from the throughput and energy characteristics (Luszczek et al., 28 Sep 2025).
- Benchmarking and Standardization: Use of these techniques in High-Performance Linpack (HPL) tests demonstrates competitive or superior kW-per-GFLOP rates (Ozaki et al., 10 Apr 2025).
7. Limitations and Future Directions
Despite performance and efficiency gains, split integer FP64 emulation presents several limitations:
- Subset Coverage: Table-lookup schemes (e.g., for compression) only support a subset of FP64; values outside the subset must revert to full representation (Neal, 2015).
- Conversion Overhead: Slicing, scaling, and CRT-based conversion steps can dominate cost for small matrices or when kernel launch overheads are high (Ozaki et al., 10 Apr 2025).
- Numerical Edge Cases: Wide exponent distributions, ill-conditioned matrices, or pathological input patterns can degrade accuracy or increase slice/modulus requirements (Ootomo et al., 2023, Luszczek et al., 28 Sep 2025).
- Special Values Handling: Interpretation of NaN, NA, and inf requires bespoke conventions to avoid ambiguity in compressed representations (Neal, 2015).
- Scalability: While large-scale problems amortize conversion overheads and benefit maximally from hardware throughput, applications with submatrix-level or rapidly varying precision requirements may require further research on dynamic or adaptive splitting strategies.
Promising future research directions include optimization of format conversion, extension to triple/quadruple precision via modular or composition techniques, and improved error analysis, as well as further hardware co-design for AI accelerators featuring INT8/FP8 Tensor Cores (Ozaki et al., 10 Apr 2025, Mukunoki, 1 Aug 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free