Mixed-Precision Arithmetic

Updated 30 January 2026

Mixed-precision arithmetic is the use of multiple floating-point formats (e.g., FP16, FP32, FP64) to optimize performance and energy efficiency while managing rounding errors.
It allocates lower precision for memory-intensive tasks and higher precision for critical computations, employing techniques such as iterative refinement and dynamic error analysis.
Empirical results show up to 4× memory reduction and 10–15% faster training and simulations, highlighting significant efficiency gains in modern hardware architectures.

Mixed-precision arithmetic refers to the use of two or more floating-point precisions within an algorithm, a codebase, or hardware system, allowing computation, communication, and storage to be selectively performed at different numeric granularities. The overarching goal is to maximize computational throughput, minimize memory and energy usage, and maintain reliability by exploiting the abundant low-precision hardware found in modern scientific and machine learning architectures, while rigorously controlling round-off and loss of fidelity. Mixed-precision techniques have found wide application in deep learning, numerical linear algebra, scientific simulations, and hardware acceleration, leveraging formats such as IEEE FP16, bfloat16, FP32, FP64, and integer and fixed-point emulation. Error analysis, precision tuning, hardware mapping, and algorithm redesign are critical components in this domain, driving significant research advances.

1. Floating-Point Formats and Mixed-Precision Motivation

Mixed-precision arithmetic relies on the distinct mathematical and hardware properties of floating-point representations:

IEEE Floating-Point Formats: Commonly employed formats include binary16 (FP16, 16 bits, $u=2^{-10}$ ), binary32 (FP32, 32 bits, $u=2^{-23}$ ), binary64 (FP64, 64 bits, $u=2^{-52}$ ), and bfloat16 (16 bits, 8-bit exponent, $u=2^{-7}$ ) (Chen et al., 3 Mar 2025, Micikevicius et al., 2017). Integer and fixed-point formats (e.g., INT8) are increasingly relevant for inference and extreme performance scenarios (Wu et al., 2024).
Precision vs. Throughput Trade-off: Lower precision computation is generally 2–16 $\times$ faster and more energy- and bandwidth-efficient than higher precision. Modern accelerators such as NVIDIA Tensor Cores and AMD Matrix Cores offer peak performance in FP16, bfloat16, INT8, and sometimes TF32 (Abdelfattah et al., 2020, Abdel-Aziz et al., 2021).
Rationale for Mixing Precisions: Not all computations require full precision. For example, in neural network training, forward/backward passes can utilize FP16 for memory savings, but small parameter updates necessitate FP32 to avoid underflow and ensure convergence (Micikevicius et al., 2017, Lewandowski et al., 2023). In numerical simulation, large-scale matrix operations or preconditioners can operate at FP32 or FP16, only reverting to FP64 for critical reductions or accumulations (Chen et al., 3 Mar 2025, Maynard et al., 2018).

2. Error Analysis: Deterministic and Probabilistic Bounds

Ensuring accuracy in mixed-precision arithmetic demands robust error analysis:

Unit Round-off Model: For any operation $fl(x\circ y) = (x\circ y)(1+\delta)$ with $|\delta| \leq u$ (unit round-off of the current format) (Bhola et al., 2024, Chen et al., 3 Mar 2025).
Cumulative Error Growth: In sequences of $n$ operations, deterministic bounds grow linearly: $|\text{computed}-\text{exact}|/|\text{exact}| \leq n u$ (Bhola et al., 2024), but probabilistic rounding effects yield tighter bounds scaling as $\sqrt{n} u$ for i.i.d. errors (Bhola et al., 2024).
Mixed-Precision Kernels: For fused multiply-add (FMA) and mixed-precision FMA (MPFMA), backward error $\zeta = u_{acc} + u_{out} + u_{acc} u_{out}$ controls the relative error. Tensor-core GEMMs combine low-precision inputs with higher-precision accumulation, with forward error bounded by $\gamma_{n-1}^{u_{acc}} + \gamma_{q}^{u_{out}} + \gamma_{n-1}^{u_{acc}} \gamma_{q}^{u_{out}}$ (Bhola et al., 2024, Yang et al., 2019).
Algorithmic Safeguards: Techniques such as maintaining master copies in higher precision, stochastic rounding during updates, blocking and compensated summations, and selecting critical reductions for high precision arithmetic are standard to curb catastrophic cancellation and accumulation errors (Lewandowski et al., 2023, Chen et al., 3 Mar 2025, Micikevicius et al., 2017, Ackmann et al., 2021).

3. Algorithmic Approaches and Workflow Design

Mixed-precision strategies are tailored according to application domain and performance/accuracy requirements:

Iterative Refinement: Solve $Ax=b$ by factorizing $A$ in low precision, compute residuals and corrections in high precision, iterate until convergence. Classical IR requires $\kappa(A) < 1/u_{low}$ ; GMRES-IR variants extend to more ill-conditioned problems via flexible Krylov subspace corrections (Oktay et al., 2021, 0808.2794, Maynard et al., 2018, Abdelfattah et al., 2020).
Deep Learning Optimizers: Traditional mixed-precision training stores FP16 copies for computation and FP32 "master" weights for accumulation. The "memory efficient mixed-precision optimizer" eliminates FP32 master copies, retaining only FP16 weights and packed extra bits to capture lost precision, with all optimizer updates executed in a fused backward kernel. This method reduces peak memory by 20–25% and accelerates training by 10–15% at equivalent accuracy (Lewandowski et al., 2023).
Precision Tuning and Tooling: POP and Anton tools systematically allocate precision per variable and operation, formulating an ILP over error propagation constraints, program labels, and conversion costs; the optimal assignment minimizes maximum bit-width, total operator footprint, or conversion overhead, subject to accuracy targets (Khalifa et al., 2022, Darulova et al., 2017).
Domain-Specific Mixed-Precision: Model reduction, interpolation, and tensor decomposition codes employ low precision for dominant QR or MGS factorizations, reserving high precision for final skeleton assembly, ensuring that round-off does not dominate intrinsic modeling error (Dunton et al., 2020, Brower et al., 6 Oct 2025). CFD, spectral, and climate codes profile and instrument each kernel with stochastic analysis tools (Verificarlo, MCA) to define stable precision reduction boundaries, maintaining high precision for global reductions and critical preconditioner steps (Chen et al., 3 Mar 2025, Ackmann et al., 2021, Chen et al., 2024).
Hybrid Direct-Sparse Solvers: Algorithms for sparse $LDU$ factorization with mixed precision partition the matrix into moderate (low precision factorized) and hard (high precision) blocks, exploiting block-GCR preconditioning and high accuracy at negligible cost except for small Schur complements (Suzuki, 2022).

4. Hardware Architectures and Implementation

Mixed-precision arithmetic is tightly interwoven with hardware design and dataflow:

Systolic and SIMD Dual-Mode Arrays: Frameworks such as TATAA enable runtime switching between int8 systolic matrix multiply and bfloat16 SIMD vector math with sub-microsecond overhead, enabling transformer-specific acceleration with only minimal accuracy loss (Wu et al., 2024).
Temporal Decomposition: Integer-based MAC engines realize higher precision operations (e.g., FP16) by splitting operands into multiple slices (low-bit chunks), accumulating results across cycles, and adjusting alignment; optimized hardware restricts mantissa size and alignment logic to empirically observed bounds, drastically lowering silicon overhead (Abdel-Aziz et al., 2021).
GPU Tensor Core and Fixed-point Emulation: Recent work leverages INT8 units on NVIDIA Blackwell to emulate FP64 via multi-slice decomposition (Ozaki scheme), dynamically dialing in required mantissa depth for chemical accuracy in quantum chemistry DMRG computations; performance matches or exceeds native FP64 for 113-electron test systems (Brower et al., 6 Oct 2025).
Precision-Programmable Frameworks: Code-generation environments such as OPS and OpenSBLI accept user-supplied dictionaries of datatype assignments for each array group, fully automating the precision layout and explicit casting in the generated C/C++/Fortran (Siklósi et al., 27 May 2025).

5. Empirical Performance and Practical Impact

Memory and Bandwidth Savings: Mixed-precision configurations yield up to 2–4 $\times$ memory reduction and 2–3 $\times$ bandwidth decrease for CFD and turbulent flow simulations, directly proportional to the fraction of data in low precision (Siklósi et al., 27 May 2025, Lewandowski et al., 2023).
Energy Efficiency: Nekbone and Neko CFD codes demonstrate 1.3–2.4 $\times$ reductions in energy-to-solution on move to mixed precision, reflecting both compute and communication improvements (Chen et al., 3 Mar 2025, Chen et al., 2024).
Training Speed and Model Fidelity: In deep learning, AMP and optimized FP16+extra-bit optimizers deliver 10–15% faster training and 20–25% peak memory savings on ResNet-18/CIFAR-10 and T5-large/GLUE, without negative impact on accuracy (Lewandowski et al., 2023). Similar results are observed for CNNs, RNNs, GANs, and speech/translation models (Micikevicius et al., 2017).
Numerical Simulation and Solver Convergence: Mixed precision in atmospheric dynamical solvers, preconditioned Krylov solvers, and sparse direct factorization leads to speedups (1.7–2 $\times$ ), reduced memory footprint (25–30%), and preservation of critical forecast metrics and accuracy (Maynard et al., 2018, Ackmann et al., 2021, Suzuki, 2022).
Error Control in Scientific Computation: Interpolative decompositions, quantum chemistry tensor network contractions, and Krylov solvers reach double-precision class accuracy provided round-off is managed beneath key singular value or MPS bond dimension thresholds (Dunton et al., 2020, Brower et al., 6 Oct 2025, Solinas et al., 28 Jan 2026).

6. Precision Allocation Strategies, Tools, and Best Practices

Static and Dynamic Analysis: Tools such as POP (Precision OPtimizer) and Anton automate assignment of bit-width per label via ILP and range analysis, optimizing program representation for hardware cost subject to user-specified accuracy at program exit (Khalifa et al., 2022, Darulova et al., 2017).
Monte Carlo Arithmetic (MCA) and Stochastic Rounding: Use of MCA (RR/full modes) and stochastic rounding inside critical update kernels detect instability and mitigate systematic rounding bias, essential in high-fidelity simulation and optimization (Chen et al., 3 Mar 2025, Lewandowski et al., 2023).
Workflow for Application Codes:
- Profile kernels via roofline or performance model to identify memory/compute-bound routines.
- Instrument each with precision emulation and rounding-error analysis toolkits.
- Prune candidates with unacceptable error bounds or stochastic instability.
- Implement mixed-precision via explicit variable assignment, minimal casts at domain boundaries, and validation against reference results.
- Benchmark in terms of time, energy, and residual accuracy for full spectrum of runs (Chen et al., 3 Mar 2025, Chen et al., 2024).
Recommendations for Solver and Simulation Codes:
- Maintain accumulators, state variables, and global reductions in higher precision if at risk of orthogonality/cancellation loss.
- Deploy low precision for local arithmetic, preconditioners, and inner-loop matrix-vector ops.
- For iterative solvers, adjust convergence criteria to unit round-off of lowest precision employed, or fallback to higher precision if residual stagnation detected.
- For distributed simulation, dynamically select message/buffer precision for communication- and memory-intensive routines.

7. Open Challenges and Future Directions

Unified Software and API Standards: The field is moving towards templated mixed-precision BLAS, SpMV, solvers, and MPI routines, with backend support for auto-precision selection and hardware-tailored mappings (Abdelfattah et al., 2020).
Advanced Number Systems: Interest in posit formats, block and tapered precision, and stochastic rounding to enable wider dynamic range and improved energy/performance trade-offs.
Auto-tuning and Machine Learning for Precision Assignment: Reinforcement learning and statistical modeling to optimize per-kernel or per-variable precision for time, energy, and accuracy budgets.
Compression in Memory and Communication: Hardware support for ZFP and lossy coding is increasingly deployed for MPI/FFT and distributed CG solvers, accelerating exascale computation in limited bandwidth environments (Abdelfattah et al., 2020).
Rigorous Error Analysis for Sparse, Factorization, and Krylov Methods: Theoretical work on stability for multiprecision block factorizations, Householder QR, GMRES, and block preconditioners remains an active area, substantiated by ongoing research on rounding behavior in block-FMA architectures and probabilistic error models (Yang et al., 2019, Bhola et al., 2024).
Extending Hardware and Libraries for Multi-format Utilization: FPGA, ASIC, and GPU libraries continue to relax format constraints, allowing flexible scheduling, precision multiplexing, and temporal arithmetic decomposition for both inference and training workloads at scale (Wu et al., 2024, Abdel-Aziz et al., 2021, Brower et al., 6 Oct 2025).

Markdown Upgrade to Chat

References (20)

Enabling mixed-precision in spectral element codes (2025)

Mixed Precision Training (2017)

TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture (2024)

A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic (2020)

Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators (2021)

Memory Efficient Mixed-Precision Optimizers (2023)

Precision of the ENDGame: Mixed-precision arithmetic in the iterative solver of the Unified Model (2018)

Deterministic and Probabilistic Rounding Error Analysis for Mixed-Precision Arithmetic on Modern Computing Units (2024)

Rounding Error Analysis of Mixed Precision Block Householder QR Algorithms (2019)

10.

Mixed-precision for Linear Solvers in Global Geophysical Flows (2021)

11.

Multistage Mixed Precision Iterative Refinement (2021)

12.

Accelerating Scientific Computations with Mixed Precision Algorithms (2008)

13.

Constrained Precision Tuning (2022)

14.

Sound Mixed-Precision Optimization with Rewriting (2017)

15.

Mixed precision matrix interpolative decompositions for model reduction (2020)

16.

Mixed-precision ab initio tensor network state methods adapted for NVIDIA Blackwell technology via emulated FP64 arithmetic (2025)

17.

Enabling mixed-precision with the help of tools: A Nekbone case study (2024)

18.

A Hybrid Factorization Algorithm for Sparse Matrix with Mixed Precision Arithmetic (2022)

19.

Reduced and mixed precision turbulent flow simulations using explicit finite difference schemes (2025)

20.

Neural Quantum States in Mixed Precision (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixed-Precision Arithmetic.