GMRES-IR: Mixed-Precision Iterative Refinement
- GMRES-IR is a mixed-precision iterative solver that uses low-precision GMRES inner iterations with high-precision residual updates to achieve double-precision accuracy.
- It accelerates computations by performing heavy operations like sparse matrix-vector multiplications and preconditioning in fp32 while maintaining critical updates in fp64.
- Benchmark studies on modern hardware demonstrate significant speedups with only a few outer iterations needed to reach high accuracy, even for large, sparse systems.
GMRES-based iterative refinement (GMRES-IR) is a numerical solver paradigm for large sparse linear systems and related problems, leveraging the Generalized Minimal Residual (GMRES) algorithm as an inner solver within an iterative refinement loop. It is specifically designed to exploit mixed-precision or low-precision arithmetic (e.g., fp16/fp32), prevalent in modern accelerator hardware, to achieve double-precision (fp64) or high-accuracy solutions at a reduced computational and energy cost. The method’s central rationale is that only a subset of operations (notably residual computation and final solution updates) must be executed in high precision to ensure backward and forward stability, while the bulk of computationally intensive linear algebra kernels can be performed in lower precision, substantially accelerating the overall solve without compromising the final solution accuracy when proper algorithmic safeguards are provided (Loe et al., 2021).
1. Algorithmic Framework and Mathematical Foundation
GMRES-IR is structured as an outer iterative-refinement loop, with each iteration comprising a residual computation, an (approximate) solution of a linear system using GMRES (possibly preconditioned), and a high-precision update:
Let , .
- High-precision computation of the initial residual:
- For until convergence:
- Inner solve in lower precision (e.g., fp32):
$\text{Find } u_k \approx A^{-1} r_k \; \text{(via GMRES$(m)$ in fp32)}$
Correction update in high precision:
New residual in high precision:
Stopping criterion:
Here, all GMRES subspace operations, SpMV (sparse matrix-vector multiplication), inner products, and (explicit) preconditioning are performed in fp32, while the residual evaluation and the update are computed in fp64 (Loe et al., 2021, Loe et al., 2021).
2. Mixed-Precision Strategy and Theoretical Rationale
Performance gains on modern architectures are realized as fp32 operations typically yield 2–3× throughput compared to fp64. However, scientific computing often mandates fp64 accuracy. The iterative refinement theory (Higham, Moler; Carson & Higham) establishes that, provided the system condition satisfies (for single-precision inner solves), the outer iteration converges rapidly to double precision accuracy. Contemporary refinements show that only the outer residual computation and update require fp64; the inner linear system correction can be safely performed in fp32 without loss of final accuracy (Loe et al., 2021).
For large, moderately to well-conditioned , the accumulation of roundoff in the fp32 inner solve is corrected at each outer step by a recomputed fp64 residual; in practice, only a few outer iterations are needed to achieve double precision accuracy.
3. Implementation, Preconditioning, and Parallelism
GMRES-IR is architected for performance portability and high concurrency, with all computationally heavy kernels (SpMV, orthogonalization, preconditioning) executed in low precision. Preconditioners must be highly parallelizable: classical incomplete LU (ILU) may introduce scalability bottlenecks due to triangular solves; thus, block–Jacobi and polynomial preconditioning are preferred (Loe et al., 2021, Loe et al., 2021).
Block–Jacobi: Decompose 0 into block diagonal 1 blocks, invert each in fp32, and apply in parallel to enable full GPU occupancy.
Polynomial preconditioning: Employs a low-degree polynomial 2 (often a truncated GMRES polynomial), all in fp32, to approximate 3, yielding a sequence of fully parallel SpMVs.
All preconditioner applications are realized in fp32: casting fp64 vectors to fp32, applying the preconditioner, and casting back as necessary. GPU implementations (e.g., via Kokkos Kernels) further streamline memory transfers and enable fused operations.
4. Parameter Selection and Convergence
Key parameters influencing GMRES-IR performance include:
Restart length (4) for GMRES: Typical values are 5. Larger 6 reduces outer iterations but increases orthogonalization overhead; smaller 7 produces more refinement iterations (Loe et al., 2021).
Preconditioner strength: Block size 8 for block–Jacobi or polynomial degree 9 are chosen to minimize the total number of inner iterations, targeting values that keep iteration counts moderate (0–1) without bloating the cost of applying the preconditioner.
Refinement tolerance: Set to the target fp64 accuracy (e.g., 2).
Convergence is typically observed in 1–3 outer iterations, with the total fp32 iteration count closely tracking that of a full-fp64 GMRES(m) with only a minor overhead.
5. Performance Evaluation and Practical Outcomes
Extensive benchmarking on NVIDIA V100 GPUs with a Power9 host has demonstrated that GMRES-IR can deliver systemic performance gains (Loe et al., 2021, Loe et al., 2021):
Kernel-level speedups:
- SpMV: 3–4 over fp64 for matrices with 5 nonzeros per row.
- Orthogonalization (GEMV): 6–7 speedup.
- Full-solver speedups:
- Unpreconditioned GMRES(50): 8–9 faster.
- Polynomial preconditioning: up to 0 speedup.
- Block–Jacobi: 1 faster for CFD-type matrices (block size 42).
- Accuracy: In all experiments, the double precision residual norm 2 obtained via GMRES-IR matched full-fp64 GMRES down to round-off (e.g., 3–4), even for systems with 5–6. Provided 7, convergence is robust.
- Scalability: When the number of nonzeros per row increases (8), SpMV and thus overall speedup declines modestly but remains significant.
6. Algorithmic Variants and Extensions
6.1. SPAI-GMRES-IR and Adaptive-Precision Variants
Variants using adaptive or sparse approximate inverse (SPAI) preconditioning further tune precision at the preconditioner level. BSPAI-GMRES-IR stores SPAI components in variable precision “buckets,” enabling a potentially significant reduction in storage and computational cost, with only a modest increase in the number of GMRES iterations. Provided the lowest bucket precision matches the working precision, convergence and accuracy are essentially unchanged (Khan et al., 2023, Carson et al., 2022).
6.2. Integer Arithmetic GMRES-IR
An integer-arithmetic-based variant substitutes all inner GMRES computations with fixed-point arithmetic, relying on precise scaling and logical operand shifts to prevent overflow. The outer (floating-point) refinement ensures that accuracy remains competitive with pure floating-point solvers, as any drift in the low-precision GMRES is corrected at each iteration (Iwashita et al., 2020).
6.3. Krylov Subspace Recycling
For problems where multiple refinement steps are needed, recycling previous Krylov spaces (e.g., via GCRO-DR) across refinement iterations can reduce the computational effort per outer step, particularly for ill-conditioned matrices, by reusing spectral information (Oktay et al., 2022).
7. Applications and Demonstrated Scenarios
GMRES-IR and its mixed-precision extensions are immediately applicable in large-scale computational science and engineering workflows that demand both high accuracy and hardware efficiency. Key contexts include:
- Solution of large, sparse, nonsymmetric systems arising from PDE discretizations, model reduction, or CFD.
- Accelerated iterative refinement for dense and sparse overdetermined least-squares, including augmented system formulations and weighted LS problems, with mixed or low-precision QR preconditioners (Carson et al., 2024, Carson et al., 2024, Gao et al., 2024).
- Matrices encountered in practical settings, such as convection-diffusion or CFD, even for challenging condition numbers up to 9, provided proper parameter selection is observed.
Summary Table: Core GMRES-IR Workflow
| Step | Precision | Main Operations |
|---|---|---|
| Residual computation | High (fp64) | $\text{Find } u_k \approx A^{-1} r_k \; \text{(via GMRES$(m)$ in fp32)}$0 |
| Inner GMRES solve | Low (fp32) | Iterative correction, SpMV, Precon |
| Preconditioning | Low (fp32) | Block–Jacobi, polynomial, or SPAI |
| Solution update | High (fp64) | $\text{Find } u_k \approx A^{-1} r_k \; \text{(via GMRES$(m)$ in fp32)}$1 |
| Convergence check | High (fp64) | Residual norm test |
References
- "A Study of Mixed Precision Strategies for GMRES on GPUs" (Loe et al., 2021)
- "Experimental Evaluation of Multiprecision Strategies for GMRES on GPUs" (Loe et al., 2021)
- "Mixed Precision Iterative Refinement with Sparse Approximate Inverse Preconditioning" (Carson et al., 2022)
- "Mixed Precision Iterative Refinement with Adaptive Precision Sparse Approximate Inverse Preconditioning" (Khan et al., 2023)
- "An Integer Arithmetic-Based Sparse Linear Solver Using a GMRES Method and Iterative Refinement" (Iwashita et al., 2020)
- "Mixed Precision GMRES-based Iterative Refinement with Recycling" (Oktay et al., 2022)
- "Mixed precision sketching for least-squares problems and its application in GMRES-based iterative refinement" (Carson et al., 2024)
- "Mixed Precision FGMRES-Based Iterative Refinement for Weighted Least Squares" (Carson et al., 2024)
- "Mixed precision iterative refinement for least squares with linear equality constraints and generalized least squares problems" (Gao et al., 2024)