GPU-Accelerated PDLP Optimization

Updated 24 October 2025

GPU-accelerated PDLP is a parallelized primal-dual optimization approach that reformulates linear programs as saddle-point problems to leverage GPU efficiency.
It utilizes sparse matrix–vector multiplications and custom GPU kernels, achieving significant speedups over traditional CPU-based solvers.
Advanced features such as restarted schemes, Halpern-type acceleration, and adaptive step-sizing enhance convergence and scalability for large-scale applications.

GPU-accelerated Primal-Dual Linear Programming (PDLP) refers to a class of first-order optimization methods—especially the Primal-Dual Hybrid Gradient (PDHG) family—deployed on graphics processing units (GPUs) to solve large-scale linear programming (LP) problems. Utilizing the inherent parallelism and memory bandwidth of modern GPUs, recent developments such as cuPDLP, cuPDLP-C, cuPDLP+, and adaptations for both NVIDIA and AMD hardware have significantly advanced the scalability, robustness, and empirical performance of first-order LP solvers. This paradigm has shifted the longstanding computational landscape dominated by simplex and interior-point algorithms toward highly parallelizable, factorization-free approaches. The following sections survey the foundational principles, algorithmic structures, empirical benchmarks, implementation strategies, theoretical frameworks, and broader implications of GPU-Accelerated PDLP.

1. Algorithmic Foundations and Reformulation

The underlying algorithm in GPU-Accelerated PDLP is the Primal-Dual Hybrid Gradient (PDHG) method and its restarted/accelerated variants. The classical linear program is reformulated as a saddle-point problem to expose both primal and dual variables to iterative updates based on matrix-vector multiplications and projections—operations highly amenable to GPU parallelism. A canonical PDHG iteration (in its averaged or Halpern-accelerated form) is:

$\begin{aligned} &x^{(k+1)} = \mathrm{proj}_{X}\left(x^{(k)} - \tau (c - K^\top y^{(k)})\right) \ &y^{(k+1)} = \mathrm{proj}_{Y}\left(y^{(k)} + \sigma (q - K(2x^{(k+1)} - x^{(k)}))\right) \end{aligned}$

with step-sizes $\tau, \sigma$ , scaling parameter $\eta$ and primal weight $\omega$ , and projectors encoding the variable bounds and dual feasibility. Matrix $K$ concatenates the equality and inequality constraints; $q$ encodes their right-hand sides. Enhanced schemes such as cuPDLP+ employ reflected and restarted Halpern-type averages:

$z^{(k+1)} = \frac{k+1}{k+2} \cdot \big((1+\gamma) \cdot \mathrm{PDHG}(z^{(k)}) - \gamma z^{(k)}\big) + \frac{1}{k+2} z^0,$

with $\gamma \in [0,1]$ . Fixed-point residuals, KKT error metrics, or primal-dual gap serve as the backbone for monitoring convergence and triggering restarts.

2. GPU-Centric Optimizations and Kernel Design

A central tenet of GPU-Accelerated PDLP solvers is maximizing in-device computation, thus minimizing expensive host-device data transfers. The algorithm's dominance of sparse matrix–vector multiplications (SpMV) and vector updates allows nearly total mapping of computational flow to GPU. Key strategies include:

Storage of the constraint matrix in compressed sparse row (CSR) format.
Invocation of high-throughput GPU libraries, e.g., cusparseSpMV() with algorithm CUSPARSE_SPMV_CSR_ALG2 on NVIDIA and optimized PyTorch/ROCm backends on AMD (Hu et al., 22 Aug 2025).
Custom kernel code for per-coordinate updates, reflecting the SIMD architecture and efficiently assigning warps/threads.
Warm-starting techniques leverage solution continuity across related subproblems, as in the use within Feasibility Pump heuristics for MIP (Çördük et al., 23 Oct 2025).

Empirical tuning of step-size ( $\eta$ ) and primal weight ( $\omega$ ) on modern GPUs is critical; recent innovations include PID-controlled adaptive updates to $\omega$ and selection of constant $\eta = 0.99 / \|A\|_2$ (using approximate spectral norm computed via power iteration).

3. Algorithmic Enhancements: Restarts, Acceleration, and Adaptivity

Outperforming vanilla PDHG in practice requires multiple layers of heuristic and theoretical refinements:

Restarted schemes: Progress is measured via either fixed-point residuals $r(z) = \|z - \mathrm{PDHG}(z)\|_P$ , the KKT error, or a normalized duality gap. When reduction in $r(z)$ stagnates or fails to meet preset decay thresholds, a restart is triggered (Lu et al., 18 Jul 2025).
Halpern-type acceleration and reflection: Halpern steps anchor iterates to the initialization, while reflection steps (e.g., replacing $\mathrm{PDHG}(z)$ with $(1+\gamma)\mathrm{PDHG}(z) - \gamma z$ ) double the contraction factor, substantially accelerating convergence toward high accuracy (Lu et al., 18 Jul 2025).
Preconditioning: Scaling via diagonal matrices (e.g., Ruiz equilibration, Chambolle–Pock scaling) is employed to improve matrix condition numbers, enhancing both accuracy and convergence speed (Lu et al., 2 Jun 2025).
Projected updates and infeasibility detection: Custom projections enforce all bounds and feasibility constraints. The analysis of step differences $(z^{k+1} - z^k)$ provides infeasibility certificates and characterization under unboundedness (Lu et al., 2 Jun 2025).
PID-controlled primal weight: A proportional-integral-derivative mechanism dynamically tunes $\omega$ to adapt the rate of progress between primal and dual spaces (Lu et al., 18 Jul 2025).

4. Empirical Performance and Benchmarking

Comprehensive benchmarks indicate that GPU-Accelerated PDLP solvers exhibit performance either comparable to or exceeding state-of-the-art commercial solvers (e.g., Gurobi, COPT) on large-scale instances. Key findings include:

Solver	Hardware	Benchmark Set	# Solved (𝜖=10⁻⁴)	Relative Performance	Comments
cuPDLP.jl	NVIDIA H100	MIPLIB 2017 LP Relax	~all	On par with Gurobi	3–20× faster than CPU PDLP for large cases
cuPDLP-C	NVIDIA H100 80GB	Mittelmann’s LP	379/383	2–4× slower than COPT	16.5 h (COPT) vs 916 s (cuPDLP-C) on 'zib03'
cuPDLP+	NVIDIA, large scale	MIPLIB 2017 LP Relax	~all	2–4× faster than cuPDLP	Improvements strongest w/ high-accuracy
AMD PDHG	AMD MI325X	SCED + Netlib	N/A	36× over CPU baseline	PyTorch/ROCm implementation

As problem dimensionality increases (e.g., $>10^7$ nonzeros), first-order GPU-accelerated solvers consistently outperform interior-point or simplex-based routines due to factorization bottlenecks (Lu et al., 2023, Lu et al., 2 Jun 2025, Hu et al., 22 Aug 2025). In mixed-integer programming (MIP), GPU-accelerated PDLP leveraged within primal heuristics enables more extensive search and faster convergence, resulting in 221 feasible solutions and a 22% primal gap on MIPLIB2017 presolved datasets (Çördük et al., 23 Oct 2025).

5. Implementation Ecosystem and Hardware Portability

The cuPDLP series exemplifies modern engineering for large-scale mathematical programming:

Programming languages: Julia (cuPDLP.jl), C (cuPDLP-C), Python with PyTorch/ROCm for portable AMD/NVIDIA support (Hu et al., 22 Aug 2025).
Libraries: CUDA.jl and cusparse/cuBLAS for NVIDIA, with PyTorch tensor libraries offering ready ROCm support for AMD GPUs.
Kernel strategies: Custom CUDA kernels, single-dimension thread blocks, and memory-resident sparse data structures.
Presolve and integration: Advanced presolve routines are included in C implementations (cuPDLP-C) to narrow problem size prior to iterations (Lu et al., 2023).

Cross-platform compatibility is highlighted by successful implementations on both NVIDIA (cuPDLP, cuPDLP+, cuPDLP-C) and AMD (PDHG with PyTorch/ROCm) hardware, demonstrating scalable performance and software portability (Hu et al., 22 Aug 2025).

6. Theoretical Guarantees and Analytical Framework

The theoretical underpinnings of GPU-Accelerated PDLP are substantiated by:

Operator splitting perspective: PDHG is interpreted as preconditioned Douglas–Rachford splitting on the monotone KKT operator, guiding its convergence and acceleration properties (Lu et al., 2 Jun 2025).
Convergence rates:
- Sublinear (ergodic): $\mathcal{O}(1/k)$ for average iterate primal-dual gap.
- Sublinear (last iterate): $\mathcal{O}(1/\sqrt{k})$ for fixed-point residual and primal-dual gap.
- Linear (sharpness): Under sharpness conditions (e.g., following Hoffman’s lemma), the KKT residual is lower-bounded by the distance to the solution set, yielding iteration complexity $\mathcal{O}(\kappa \log(1/\epsilon))$ (Lu et al., 2 Jun 2025).
Restarts and acceleration: Theoretical motivation links Halpern and reflected accelerations to improved contraction factors and confirms global linear convergence in the presence of problem sharpness.
Infeasibility detection: Convergence of step differences to an infimal displacement certificate enables early detection of infeasible or unbounded instances (Lu et al., 2 Jun 2025).

7. Extensions and Domain Impact

The GPU-Accelerated PDLP paradigm generalizes to broader problem classes, including:

Quadratic Programming: Momentum-accelerated PDHG (PDHCG) and PDQP handle large-scale QPs with strong GPU scalability and fast empirical convergence (Lu et al., 2 Jun 2025).
Semidefinite and Conic Programming: Low-rank variable factorization (Burer–Monteiro) and first-order ADMM methods (e.g., cuLoRADS, ALORA) enable the solution of very large SDPs.
Nonlinear Programming: GPU-enabled condensed-space interior-point algorithms, as in MadNLP, minimize dependency on serial factorizations by restructuring nonlinear KKT systems into matrix-vector computations.
Mixed Integer Programming: Integration with GPU-accelerated PDLP as projection or relaxation oracles within primal heuristics (Feasibility Pump, Fix-and-Propagate, Efficient Local Search) significantly boosts the number of feasible integer solutions and narrows primal gaps on challenging benchmarks (Çördük et al., 23 Oct 2025).
Industrial Optimization: Applications in SCED (power systems engineering), supply chain, finance, and logistics directly benefit due to their large, sparse LP structure and need for real-time or near-real-time solutions (Hu et al., 22 Aug 2025).

Summary Table: Key GPU-Accelerated PDLP Developments

Solver	Language	Hardware	Principal Innovations	Benchmark Highlights
cuPDLP.jl	Julia + CUDA	NVIDIA	Restarted PDHG, GPU-native SpMV, KKT-based restart	Gurobi-comparable, 3–20× vs CPU
cuPDLP-C	C + CUDA	NVIDIA	Advanced presolve, C-optimization, full KKT monitoring	Superior for very large LPs
cuPDLP+	C/Julia + CUDA	NVIDIA	Reflected Halpern PDHG, PID update, FP-residual restart	2–4× faster than cuPDLP
AMD PDHG	PyTorch + ROCm	AMD	Cross-platform, "fishnet casting," adaptive step-size	36× speedup on MI325X vs CPU
FP/MIP-Heur.	CUDA/C++	NVIDIA	GPU-ELS, probing cache, bulk rounding, parallel propagation	221 feasible, 22% gap (MIPLIB2017)

GPU-Accelerated PDLP constitutes a scalable, highly parallel alternative to traditional factorization-based LP solvers, delivering strong empirical and theoretical performance, especially on modern GPU hardware, and enabling practical solution of previously intractable large-scale problems across diverse domains.