Multistage Mixed Precision IR

Updated 11 May 2026

MSIR is a numerical linear algebra strategy that leverages low, intermediate, and high precision computations to balance computational cost with accuracy.
It uses staged precision—beginning with low-precision factorizations and escalating to high-precision residual computations—to ensure convergence in ill-conditioned systems.
Applications include solving linear systems, matrix equations, and least-squares problems, yielding significant performance, memory, and energy efficiency gains.

A multistage mixed precision iterative refinement (MSIR) framework refers to an algorithmic paradigm that exploits several floating-point precision levels at distinct stages of a numerical linear algebra workflow, optimizing the balance between computational cost and attainable numerical accuracy. MSIR methods strategically start with low-precision computations for the most expensive operations and escalate to higher precisions in subsequent refinement phases to reach working-precision accuracy. This approach is particularly effective when deployed on modern hardware with native support for low-precision arithmetic, where the performance, energy, and memory savings are substantial. MSIR principles now underpin state-of-the-art algorithms for the solution of linear systems, matrix equations (e.g., Sylvester and Lyapunov), least-squares problems, and matrix decompositions, and have been rigorously analyzed in recent research (Dmytryshyn et al., 5 Mar 2025, Khan et al., 2023, Ge et al., 8 Jan 2025, Oktay et al., 2021, Carson et al., 2024, Dunton et al., 2020, Carson et al., 2024, Carson et al., 2024, Carson et al., 2022).

1. MSIR Algorithmic Structure and Precision Hierarchy

MSIR is characterized by a deliberate, stage-wise allocation of computations to multiple precisions:

Low precision: Used for factorization, matrix reductions (e.g., Schur or QR), sketching, or preconditioner construction, leveraging fast but inexact formats such as FP16 or FP32.
Intermediate/working precision: Cast as the precision where iterates, corrections, solution updates, and possibly basis vector orthogonalizations are performed, often FP64 or double precision.
High (residual) precision: Applied to critical scalar functionals (e.g., norms, residuals) and convergence monitors, sometimes requiring even quad precision to ensure correct termination, especially in extreme ill-conditioning regimes.

The division into two, three, or up to five arithmetic layers (as in state-of-the-art GMRES-based IR with adaptive sparse approximate inverse preconditioning (Khan et al., 2023, Carson et al., 2022)) is central to the flexibility and performance of MSIR. Precision assignments are tailored per stage, for example:

Stage	Typical Operation	Precision
Factorization/Sketch	QR, Schur, SPAI, etc.	u_f (low)
Correction Solves	Triangular, GMRES, LSQR, etc.	u, u_g, u_p
Residual Computation	Norms, backward errors	u_r (high)
Solution Updates	Iterates, corrections	u
Preconditioner Apply	Bucketed or split preconditioner	bucketed/u_s

This hierarchy supports a dynamic workflow that transitions from highly efficient, low-precision phases (dominant in flop count) to higher-precision refinement steps, with the goal of attaining error levels commensurate with the working precision.

2. MSIR in Direct and Iterative Linear Solvers

Canonical MSIR solvers for linear systems (Ax = b) and associated preconditioned iterative methods can be grouped as follows:

Standard IR (SIR): Uses low-precision factorizations and corrective solves, limited to condition numbers κ(A) < u_f^{-1}.
GMRES-IR: GMRES corrections are computed in higher precision, handling larger κ(A) but at increased computational cost.
Adaptive Multistage Schedulers: Detect stalling via precision-insensitive convergence monitors (e.g., normwise contraction ratios, stagnation in corrections) and escalate the solver or factorization precision only as needed (Oktay et al., 2021).

The iterative refinement process relies on an initial low-precision solution, high-precision residual computation, and iterative correction steps, with the possibility of substituting preconditioners (e.g., LU, SPAI, or bucketed adaptive-precision SPAI (Khan et al., 2023)) to further optimize the trade-off between memory/storage and solver convergence.

3. MSIR for Matrix Equations: The Sylvester Equation Paradigm

The multistage mixed-precision IR algorithm for the Sylvester equation (A X + X B = C) (Dmytryshyn et al., 5 Mar 2025) exemplifies the MSIR approach:

Low-precision Schur Reductions: Compute A ≈ Q_A T_A Q_A^*, B ≈ Q_B T_B Q_B^* in low precision (u_low)—O(m^3+n³⁾ flops.
Initial Low-precision Triangular Solve: Transform the right-hand side and solve T_A Y_0 + Y_0 T_B = D to O(u_low) accuracy.
Iterative Refinement in Working Precision: Refine Y_i by stationary iteration in high precision with convergence guaranteed if ∥Δ_A∥ + ∥Δ_B∥ < sep_F(T_A,−T_B).
Recovery Phase: Since Q_A and Q_B are only unitary to O(u_low), two remedies are proposed: re-orthonormalization in working precision or explicit high-precision inversion.
Cost Model: The break-even point for MSIR superiority is derived explicitly. For k being the number of IR steps and ρ = cost_low/cost_high, the scheme is favored when k ≤ k^*(ρ).

This structure is typical of MSIR for general matrix equations and underscores the crucial role of both low-precision reductions and high-precision refinement.

4. MSIR in Least-Squares, Sketching, and Decomposition

Recent MSIR schemes extend beyond square systems to overdetermined (least-squares) problems and low-rank matrix decompositions:

Mixed Precision Sketching and Preconditioning: Stage 1 performs randomized sketching of A in low precision, Stage 2 computes the QR of the sketch in potentially higher (yet still not “working”) precision, and Stages 3–4 achieve working-precision solution via adjusted LSQR and iterative refinement with the right-preconditioner (Carson et al., 2024).
Multistage IR for LS and WLS: For least-squares and weighted least-squares, multistage schemes use varying precisions for QR (u_f), preconditioner solves (u_s), and solution updates/residuals (u) (Carson et al., 2024). Multiple IR strategies (normal equations, semi-normal, augmented/multi-LS) exhibit complementary convergence properties depending on the problem's condition number and residual magnitude (Carson et al., 2024).
Interpolative Decompositions (ID): The MSIR approach in model reduction utilizes low-precision pivoted QR (FP16/FP32) for column selection, with final assembly in double precision (Dunton et al., 2020). The error analysis shows double-precision-accurate ID is achievable provided σ_{k+1}/σ_1 ≫ u_L.

5. Error Analysis, Convergence, and Switching Logic

MSIR algorithms are designed around tight finite-precision analyses and robust switching logic:

Convergence Bounds: For each IR variant, precise condition number and unit roundoff thresholds determine whether O(u) backward or forward errors are attainable. For example, in standard SIR, κ∞(A) < u_f^{-1} suffices, while higher-order GMRES-based MSIR variants relax this to κ∞(A) < u^{{-1/3}u_f^{-2/3}} or better (Oktay et al., 2021).
Residual and Correction Monitors: Cheaply computable estimates, such as z_i = ∥d_{i+1}∥∞/∥x_i∥∞ and contraction factors v_i, drive adaptive switches between algorithmic variants or upgrades in precision.
Preconditioner Adaptivity: In sparse systems, bucketed SPAI approaches partition the preconditioner by magnitude and assign adaptive precision, reducing memory but increasing GMRES iterations proportionally to the relaxation tolerance (Khan et al., 2023).
Automatic Escalation: The recursive MSIR philosophy starts at the lowest cost and climbs the solver strength/precision ladder only upon observed stagnation or contraction failures.

6. Performance Trade-offs and Practical Impact

The practical benefits of MSIR are quantifiable and confirmed by extensive numerical experiments:

Wall-clock Speedup: In representative sparse large-scale problems, three-precision MSIR (e.g., FP16/FP32/FP64) yields 1.5–3× speedup over mono-precision, and up to 7–8× when the proportion of FP16 operations is maximized (Ge et al., 8 Jan 2025).
Memory and Bandwidth Savings: Storing preconditioners and factorizations in low precision results in substantial reductions (e.g., up to 60% for adaptive SPAI (Khan et al., 2023)).
Energy Efficiency: On hardware with fast FP16/FP32 TensorCores or similar accelerators, lower-precision work translates nearly proportionally into energy and time savings.
Convergence Robustness: The adaptive, multistage nature of MSIR ensures convergence even in highly ill-conditioned regimes, invoking more robust (but costlier) kernels only as last resort.

7. Theoretical and Practical Limits

All MSIR variants are subject to theoretically derived bounds that guide both algorithmic deployment and parameter tuning:

Condition Number Windows: Each combination of precisions and solver variants admits provable constraints, e.g., κ(A) < u^{{-1/3}u_f^{-2/3}} in five-precision GMRES-IR.
Breakdown Points: Stalling in low precision due to overflows, underflows, or insufficient contraction triggers escalation in the MSIR stack.
Final Accuracy: Provided singular value decays and/or preconditioner construction tolerances are set relative to the input problem and target working precision, MSIR approaches achieve results indistinguishable from all-double-precision workflows while incurring only a fraction of their cost.

MSIR is therefore a foundational methodology for fast, accurate, and scalable numerical linear algebra on heterogeneous precision platforms, and its descendants represent the current best practice for exploiting both algorithmic and hardware diversity (Dmytryshyn et al., 5 Mar 2025, Khan et al., 2023, Oktay et al., 2021, Carson et al., 2024, Dunton et al., 2020, Ge et al., 8 Jan 2025, Carson et al., 2024, Carson et al., 2024, Carson et al., 2022).