Mixed-Precision Decomposition Strategies

Updated 17 March 2026

Mixed-precision decomposition is a strategy that combines low- and high-precision arithmetic to optimize performance and accuracy in computational algorithms.
It exploits hardware efficiencies by using reduced-precision operations for bandwidth-bound tasks and reserving full precision for accuracy-critical steps.
Empirical studies report speedups from 1.3× to 20× and significant energy and memory savings in applications ranging from dense factorizations to deep learning.

Mixed-precision decomposition is a design and algorithmic strategy wherein different portions or stages of matrix and tensor decompositions are performed using multiple arithmetics of varying precision—typically pairing low- or reduced-precision computation for the most expensive or bandwidth-bound operations with high-precision computation for accuracy-critical steps or final results. This approach exploits the hardware efficiency of lower-precision units (FP16, INT8, BF16, etc.) while retaining the stability and accuracy of full-precision (FP32, FP64) numerical algorithms. Mixed-precision decomposition has enabled significant improvements in performance, energy efficiency, and memory utilization across dense linear algebra, scientific computing, randomized numerical linear algebra, tensor factorization, and deep learning (Abdel-Aziz et al., 2021, Yang et al., 2022, Gao et al., 2022, Dunton et al., 2020, Luszczek et al., 28 Sep 2025, Carson et al., 2024, Zhang et al., 2022, Carson et al., 12 Mar 2026, Xu et al., 1 May 2025).

1. Architectural Strategies: Temporal and Spatial Decomposition

Mixed-precision decomposition at the hardware–microarchitecture level is exemplified by temporal decomposition of floating-point inner products, as in FP16×FP16 accumulation using a low-bit-width MAC (multiply–accumulate) unit. Rather than instantiating a full-width floating-point FMA, operands’ mantissas are split into k-bit chunks (e.g., for FP16, 11 bits split into high and low 8-bit portions), and the product is reconstructed from low-bit partial products:

$M_x \cdot M_y = M_x^H M_y^H \cdot 2^{2k} + (M_x^H M_y^L + M_x^L M_y^H) \cdot 2^k + M_x^L M_y^L$

Each term is computed on a compact k×k multiplier and temporally scheduled. Exponent alignment is managed in hardware via a small (empirically ≤8-bit) barrel shifter. Profiling across DNN inference shows partial products are well-bounded and exponent range is much less than worst case; this reduces required bit-width for both partial products and alignment logic (Abdel-Aziz et al., 2021).

Impact: Up to 25–46% area reduction and 40–63% power efficiency improvement are reported for FP16 and INT8 workloads, respectively, relative to designs with wide multipliers and range shifters (Abdel-Aziz et al., 2021).

2. Algorithmic Mixed-Precision for Matrix and Tensor Decomposition

2.1. Low-Rank and Interpolative Decomposition

Matrix interpolative decomposition (ID) can be accelerated by executing the computationally intensive column-pivoted QR decomposition in reduced precision (half or single), and then reconstructing the final skeleton matrix in double precision (Dunton et al., 2020). This approach exploits the fact that approximation error from low-precision arithmetic may be dwarfed by the intrinsic low-rank truncation error, provided the singular value decay is sufficient and the chosen low precision satisfies

$u_L \ll \sqrt{1 + k(n-k)}\,\sigma_{k+1}(A_D)$

where $u_L$ is low-precision unit roundoff. Numerical experiments in model reduction show near double-precision accuracy is achievable unless half-precision is used with very slow singular value decay. The QR step in low precision leads to significant speed and memory savings.

2.2. Blockwise and Iterative Mixed-Precision for Sparse and Dense Factorizations

In large sparse LDU/LU factorizations, a hybrid scheme factors all well-conditioned blocks in low precision, only switching to high precision for the final small Schur complement, which is formed using a preconditioned block-GCR solver (Suzuki, 2022). This arrangement delivers speedups of 2–6× while retaining the accuracy of full high-precision factorization, as the difficult (ill-conditioned or kernel-revealing) portion is isolated and handled with high accuracy.

For dense factorizations (e.g., LU, Cholesky), recent work has demonstrated the utility of integer emulation of higher-precision floating-point (FP64) arithmetic (split-int emulation via INT8 "limbs" and accumulation) for matrix–matrix updates on tensor-core units. The frontal panels are kept in true FP64, while the bulk of arithmetic is offloaded to highly parallel INT8 GEMMs, yielding up to 1.7× speedup on flagship GPUs and throughput competitive with native FP32 (Luszczek et al., 28 Sep 2025).

3. Mixed-Precision Techniques for Eigenvalue and SVD Computations

Mixed-precision Jacobi SVD and eigenvalue algorithms construct an initial decomposition in low precision (e.g., single), then transform and refine the result in double (or higher) precision. After forming the low-precision SVD or EVD, a high-precision orthogonalization (e.g., modified Gram–Schmidt or Newton–Schulz polar decomposition) enforces orthogonality, and a few sweeps of high-precision Jacobi rotations rapidly drive residuals to machine precision (Zhang et al., 2022, Gao et al., 2022, Zhou, 30 Aug 2025). For Schur decompositions, Newton-refinement computes a correction by solving a Sylvester equation in low precision and applies a high-precision reorthogonalization, yielding quadratic convergence and minimizing high-precision operations to a handful of BLAS-3 kernels (Bujanović et al., 2022).

3.2. Thin SVD and Gram-Jacobi Methods

For thin/tall-and-skinny matrices, mixed-precision Gram-Jacobi SVD computes the Gram matrix $A^T A$ in high precision (to control squaring errors), conducts the eigenvalue decomposition using a low-precision Jacobi solver, and recovers the left singular vectors via a low-precision matrix–vector multiply followed by high-precision scaling. Under moderate condition numbers, this achieves up to 10–12× node-level and ~2× distributed speedup, while singular value errors remain within $O(u_H + u_L)$ , where $u_H$ and $u_L$ are high and low precisions respectively (Carson et al., 12 Mar 2026).

3.3. Orthogonalization-Free Projection (OFRR)

Recent advances dispense with explicit orthogonalization of trial subspaces, computing reduced pencil matrices $U^T A U$ and $U^T U$ in low precision and solving the small generalized eigenproblem in high precision (Xu et al., 1 May 2025). Provided the non-orthogonal basis remains well-conditioned, the error remains controlled; this leverages the high throughput of low-precision arithmetic for basis construction and projection.

4. Mixed-Precision in Hierarchical and Structured Matrices

In hierarchical decompositions such as HODLR (Hierarchical Off-Diagonal Low Rank) formats, off-diagonal generators (U, V factors) can be stored at a precision dynamically adapted per level, as dictated by

$u_k \leq \frac{\varepsilon}{2^{k/2}\xi_k}$

where $u_L \ll \sqrt{1 + k(n-k)}\,\sigma_{k+1}(A_D)$ 0 is the compression error and $u_L \ll \sqrt{1 + k(n-k)}\,\sigma_{k+1}(A_D)$ 1 is the relative norm of blocks at level $u_L \ll \sqrt{1 + k(n-k)}\,\sigma_{k+1}(A_D)$ 2. The global error is provably bounded by a modest factor $u_L \ll \sqrt{1 + k(n-k)}\,\sigma_{k+1}(A_D)$ 3 of $u_L \ll \sqrt{1 + k(n-k)}\,\sigma_{k+1}(A_D)$ 4, regardless of aggressively lowering precision for coarser levels (Carson et al., 2024). This strategy reduces memory/storage requirements by up to 2× and maintains stability for matvec and factorization as long as working precision is commensurate with the target accuracy.

5. Mixed-Precision in Tensor Decomposition and Randomized Linear Algebra

5.1. Blockwise and Adaptive Bitwidth Decomposition

In CP tensor decomposition, a two-stage mixed-precision SGD alternates low-precision (e.g., INT2/INT4/INT8) block-wise gradient updates for rapid global descent (SignSGD), followed by higher-precision local refinement. Hardware-efficient implementation uses FP16 for sensitive intermediates (e.g., Khatri-Rao products) and ultra-low-bit formats for main GEMM steps. Theoretical analysis confirms local strong convexity and global convergence, with empirical speedups of 2–5× and negligible degradation in solution accuracy (Yang et al., 2022).

5.2. Mixed-Precision Random Projection

Randomized SVD and HOSVD pipelines can safely store and deploy random projection matrices (Gaussian sketches) in reduced precision (FP16), as long as the error in the sketch variance is negligible. Matrix multiplications between FP32 input and FP16 random matrices are implemented with specialized mixed GEMM kernels (e.g., SHGEMM on NVIDIA Tensor Cores), approaching or surpassing SGEMM's throughput. End-to-end, these techniques yield up to 1.75× speedups while maintaining SVD approximation errors indistinguishable from FP32 baselines (Ootomo et al., 2023).

6. Practical Guidance, Performance, and Implementation Trade-offs

Mixed-precision decomposition achieves its performance and efficiency goals only when the following conditions are met:

The bulk of arithmetic is offloaded to low-precision units, with critical algorithmic steps (e.g., basis orthogonalization, Schur complement solves, residuals) maintained in high precision, often using a three-precision model (residual in "ultra" precision, updates in working precision, bulk solves in low precision) (Ge et al., 24 Dec 2025).
Stability and forward/backward error are supported by theoretical analysis—e.g., ensuring ill-conditioned portions of a matrix are handled in high precision or maintaining strong convexity (Yang et al., 2022, Ge et al., 24 Dec 2025).
Adaptive selection of working and storage precision per hierarchical level or block (e.g., HODLR, block LU) allows for further memory and compute savings without loss in global error guarantees (Carson et al., 2024).
Convergence acceleration and "mixed-precision effect" can be achieved even in uniformly quantized or binary frameworks via algorithms that emulate mixed-precision at the channel/weight-group level at negligible hardware overhead, e.g., through multipoint quantization (Liu et al., 2020).

Empirical results consistently report speedups ranging from 1.3–2× for SVD, LU, and CP decompositions on generic x86 and GPU platforms up to 10–20× for low-precision-optimized kernels on modern accelerators (Carson et al., 12 Mar 2026, Abdel-Aziz et al., 2021, Luszczek et al., 28 Sep 2025, Ootomo et al., 2023). Energy and memory savings can reach factors of 2–4×, with matching final accuracy as long as precision selection and critical path analysis are adhered to.

7. Future Directions and Limitations

Active avenues include extending split-integer emulation for wider-range floating point via 3- or 4-way decomposition (at cost of additional GEMMs), deployment of mixed-precision schemes in more general block-structured or sparse factorizations, and systematic data-driven parameterization (e.g., α selection for ADI methods via kernel regression) for automated tuning (Luszczek et al., 28 Sep 2025, Ge et al., 24 Dec 2025). Current limitations remain for highly ill-conditioned problems, very small matrix sizes (where conversion and overheads dominate), and situations where low-precision units lack sufficient dynamic range or hardware support.

References:

(Abdel-Aziz et al., 2021, Yang et al., 2022, Gao et al., 2022, Dunton et al., 2020, Luszczek et al., 28 Sep 2025, Carson et al., 2024, Zhang et al., 2022, Carson et al., 12 Mar 2026, Xu et al., 1 May 2025, Suzuki, 2022, Zhou, 30 Aug 2025, Ootomo et al., 2023, Ge et al., 24 Dec 2025, Liu et al., 2020, Li et al., 2020)