Primal-Dual Hybrid Gradient (PDHG) Solver
- PDHG is a first-order method developed for convex–concave saddle-point problems in large-scale optimization settings.
- It relies on matrix–vector multiplications and proximal evaluations to efficiently handle nonsmooth operators in applications like imaging, PDEs, and LP.
- Advanced techniques such as metric preconditioning, Anderson acceleration, and hybrid methods enhance PDHG’s convergence rate, scalability, and practical performance.
The Primal-Dual Hybrid Gradient (PDHG) solver is a first-order method for solving convex–concave saddle-point problems, with a structure particularly suited for large-scale and constrained optimization arising in imaging, partial differential equations (PDEs), linear programming (LP), statistical machine learning, and related domains. Its principal advantage is efficient handling of nonsmooth operators and composite functionals with only matrix–vector multiplications and proximal evaluations, enabling robust, matrix-free, and highly parallelizable implementations on modern computational hardware.
1. Algorithmic Structure and Canonical Formulations
The PDHG method addresses saddle-point problems of the form
where is the primal variable, the dual, a linear operator, and , proper, convex, lower-semicontinuous functionals. Via with , many key applications—in particular, total variation regularization, PDE-constrained optimization, and LP—fit this paradigm.
The classical PDHG iteration is
with step sizes and extrapolation parameter 0, and convergence guaranteed for 1 (Liu et al., 2024).
Specialization to LPs yields updates of the form
2
where 3, 4 are projections onto box or nonnegativity constraints (Applegate et al., 13 Jan 2025, Applegate et al., 2021).
For variational imaging and PDE time-stepping, the saddle-point structure often involves nonlocal or PDE-discretization operators (such as graph gradients or Laplacians), and the discrete update may require FFT-based inversions or graph operations (Zhu et al., 2016, Liu et al., 2023).
2. Preconditioning and Enhanced PDHG Variants
PDHG's base convergence rate can deteriorate under ill-conditioning, as commonly arises in high-dimensional PDEs and highly unbalanced linear operators. Several enhanced variants and preconditioning techniques have been proposed:
- Metric Preconditioning: The proximal steps in 5 and 6 are replaced by those in weighted, operator-induced inner products, leading to updates like:
7
The metric matrices are chosen based on the Gram structure of the underlying PDE or statistical model (e.g., 8, 9 in PDEs), and Krylov subspace solvers (e.g., MINRES) can be used to invert these operators efficiently in neural-network-parameterized settings (Liu et al., 2024).
- Diagonal and Online Preconditioning: In LPs and imaging, diagonal matrices 0, 1 are learned (online or offline) to equilibrate the effect of each coordinate. Online schemes employ loss-based feedback, e.g., measuring Lagrangian drops per step, and adjust 2 via projected subgradient steps, sometimes normalized for stability and updated only every few iterations to reduce overhead. Empirical benchmarks show 10–20% reductions in iteration counts and solve times (Lu et al., 21 Jun 2025).
- Balanced and Block Bregman Proximal Terms: The triple-Bregman TBDA algorithm generalizes PDHG by embedding three Bregman divergences, supporting two dual and one primal subproblems per iteration, and allowing more aggressive step sizes (up to a 3 factor improvement over standard PDHG in key settings) (Yu et al., 8 Jun 2025).
- Anderson Acceleration: PDHG can be wrapped using Anderson acceleration on the fixed-point operator as in AA-PDHG and FAA-PDHG. Safeguards and filtering (angle and length criteria) ensure boundedness of the update operator and global convergence. Numerical results show order-of-magnitude speedups over vanilla PDHG, especially near optimality in degenerate settings (Zhou et al., 11 Aug 2025).
3. Convergence Theory and Rates
The base ergodic convergence for PDHG is 4 in the primal-dual gap. Under strong convexity in 5 or 6, linear convergence is obtained. For linear operators 7 with full rank, and after finite identification of inactive constraints (notably, in LP, when the active set has stabilized), local sharpness constants replace the global Hoffman constant and govern an eventual linear convergence phase (Lu et al., 2023, Lu et al., 2024).
Key results:
- Two-Phase Convergence in LP: Initial sublinear identification phase (duration determined by a near-degeneracy metric), followed by linear convergence governed by the sharpness of the identified homogeneous system. Degeneracy does not impair asymptotic rate; only "near-degeneracy" affects the identification phase length (Lu et al., 2023).
- Restarted Halpern PDHG (rHPDHG/r²HPDHG): Using Halpern iteration and (optionally) its reflected version leads to globally accelerated linear convergence with parameter-free sharpness-adaptive restarting. The reflected variant achieves a further speedup by halving the epoch length, with empirical improvements of 20–30% over previous restarted approaches on industrial LPs (Lu et al., 2024).
- Infimal Sub-differential Size (IDS): IDS provides a geometry-aware progress metric for PDHG and can be efficiently estimated at each iterate. IDS enjoys monotonic decay and 8 last-iterate convergence, with metric subregularity conditions leading to linear convergence; it unifies convergence analyses across PDHG, ADMM, and the Proximal Point Method (Lu et al., 2022).
4. Implementation Strategies and Hardware Acceleration
PDHG's reliance on matrix-vector and proximal computations makes it exceptionally well-suited to parallel hardware and distributed memory.
- GPU Implementations: PDHG and its variants (including rHPDHG, preconditioned, and Anderson-accelerated forms) have efficient Julia and C++ implementations (e.g., cuPDLP.jl, PDLP for OR-Tools). Coalesced GPU kernels handle SpMV, projections, step size adaptation, and all KKT residuals within device memory to avoid PCIe stalls. Empirical results show that cuPDLP.jl matches or outperforms state-of-the-art simplex and barrier solvers (e.g., Gurobi) on benchmark LPs in both speed and solution count at practical tolerances (Lu et al., 2023, Applegate et al., 13 Jan 2025).
- Distributed and In-Memory Architectures: Multi-GPU scalability is achieved via 2D grid partitioning, nonzero-aware block randomization, and horizontally and vertically fused AllReduce collectives (NCCL). With preallocation, block balancing, and kernel fusion, nearly linear scaling is demonstrated up to billion-scale LPs (Li et al., 12 Jan 2026). Custom symmetric block-matrix designs enable RRAM in-memory acceleration, with up to 9 energy savings and comparable accuracy to GPU solvers in PDHG applications (Vo et al., 25 Sep 2025).
- Physics-Based and PDE Applications: For reaction-diffusion, Hamilton–Jacobi, and optimal transport PDEs, PDHG is used to solve single- or multi-step implicit discretizations, with O(0) per-iteration cost (<200 iterations for typical problems), FFT-based Laplacian preconditioning, and robust convergence at standard time steps. Preconditioned PDHG can be extended to adversarial neural PDE solvers, leveraging natural gradient steps and Krylov subspace methods for parameter-level updates (Liu et al., 2023, Meng et al., 2023, Liu et al., 2024).
5. Practical Enhancements and Applications
PDHG serves as a foundation for a broad class of enhancements. Noteworthy developments include:
- Diagonal and Online Preconditioning for LP: Both offline (e.g., Ruiz and Pock–Chambolle scaling) and online (feedback-driven) strategies reduce iteration counts substantially, and are seamlessly integrated into production codes (Lu et al., 21 Jun 2025).
- Hybrid PDHG–Interior Point Methods: PDHG can serve as a low-tolerance phase for warm-starting high-accuracy interior-point methods. This hybridization strikes a balance between the speed of first-order solvers and the accuracy of classical second-order methods; in industrial LPs, hybrid crossover frequently halves interior point iterations and approaches barrier method accuracy at 2–3× the PDHG baseline runtime (Rothberg, 3 Mar 2026).
- Imaging/Signal Processing: PDHG efficiently solves matrix-stacked, multi-term imaging objectives (e.g., directional TV, constrained DBT), with analytically determined step sizes via operator norm computations, supporting only two or three intuitive tuning knobs (Sidky et al., 24 Apr 2026). In graph-based nonlocal total variation and clustering, PDHG efficiently handles blockwise and quadratic relaxations and outperforms K-means and NMF-style baselines (Zhu et al., 2016).
- Stochastic PDHG: By substituting stochastic (mini-batch or single-sample) gradients in the forward step, PDHG yields optimal generalization and 1 or 2 last-iterate convergence rates under high-probability guarantees, outperforming stochastic ADMM-type competitors on large, real-world datasets (Qiao et al., 2018).
6. Empirical Performance and Limitations
PDHG (and its tuned variants) consistently achieves competitive or superior performance relative to interior-point, simplex, and ADMM-based methods on large, sparse, and ill-conditioned instances across benchmark LP and SDP test sets. For instance, PDLP solves LPs with up to 3 billion nonzeros to primal-dual gaps 4 in days, often beating barrier methods that exhaust RAM (Applegate et al., 13 Jan 2025). In in-memory analog and distributed scenarios, PDHG implementations realize orders-of-magnitude improvements in energy and time while maintaining FP64 accuracy (Vo et al., 25 Sep 2025, Li et al., 12 Jan 2026).
PDHG is robust to degeneracy in LPs (stagnation only occurs if variables are nearly, not exactly, degenerate) (Lu et al., 2023). Nevertheless, its asymptotic rate is dictated by operator conditioning and may slow for highly ill-posed or nearly-degenerate systems. Preconditioning, adaptive step sizes, and algorithmic acceleration techniques (e.g., Anderson acceleration, TBDA, rHPDHG, hybrid approaches) address much of this limitation in practice.
7. Summary Table: Key PDHG Extensions and Advances
| Enhancement / Variant | Problem Setting | Noted Benefits |
|---|---|---|
| Diagonal/Online Preconditioning | LP, imaging | 10-20% fewer iterations, speedup (Lu et al., 21 Jun 2025) |
| TBDA (triple Bregman) | General convex–concave saddle points | Allows larger stepsizes (Yu et al., 8 Jun 2025) |
| rHPDHG, r²HPDHG | LP | Optimal linear convergence; robust on GPU (Lu et al., 2024) |
| Anderson Acceleration (AA, FAA) | LP, general saddle | Order-of-magnitude speedups, global convergence (Zhou et al., 11 Aug 2025) |
| Hybrid PDHG–Interior Point | LP | High accuracy, lower walltime (Rothberg, 3 Mar 2026) |
| Multi-GPU, In-Memory PDHG | Huge-scale LP | Near-linear scaling, energy reduction (Vo et al., 25 Sep 2025, Li et al., 12 Jan 2026) |
| Stochastic PDHG | Empirical risk minimization | High-probability, optimal sample complexity (Qiao et al., 2018) |
| Imaging, PDE, clustering | Imaging, PDEs, clustering | Robust, 5 cost per iter, flexible constraints (Liu et al., 2023, Sidky et al., 24 Apr 2026, Zhu et al., 2016) |
PDHG and its modern variants represent a central family of first-order optimization algorithms with broad theoretical guarantees, scalable practical implementations, and robust performance across a spectrum of modern computational mathematics applications.