cuPDLP+: GPU-Based LP Solver

Updated 25 July 2025

cuPDLP+ is a GPU-based first-order solver for large-scale linear programs that employs Halpern PDHG with reflection, adaptive restart criteria, and PID-controlled weight adjustments.
It leverages custom GPU kernels and efficient data management to deliver a 2× to 4× speedup compared to previous versions, particularly in high-accuracy regimes.
The solver is practically applied in operations research and network optimization, exploiting modern GPU architectures like NVIDIA H100 for scalable performance.

cuPDLP+ is a further enhanced, GPU-based first-order solver for large-scale linear programming (LP) that advances the state of the art in parallel LP optimization. Building on the cuPDLP family, cuPDLP+ integrates algorithmic advances in restarted Halpern primal-dual hybrid gradient (PDHG) methods with reflection, a novel adaptive restart criterion, and a PID-controlled primal weight schedule. These innovations are engineered for GPU architectures, resulting in significant empirical improvements: on standard MIPLIB LP relaxations, cuPDLP+ achieves a 2× to 4× speedup over prior versions, with its strongest gains observed in high-accuracy and presolve-enabled regimes (Lu et al., 18 Jul 2025).

1. Algorithmic Innovations in cuPDLP+

cuPDLP+ introduces several interlinked methodological enhancements targeted at accelerating convergence and improving robustness for large-scale, sparse LPs on GPUs:

a) Restarted Halpern PDHG with Reflection

cuPDLP+ replaces the conventional restarted averaged PDHG methodology with a Halpern-iterated and reflected PDHG variant. The Halpern scheme interpolates between the most recent PDHG iterate and a fixed anchor point, formally: $z^{(k+1)} = \frac{k+1}{k+2} \, \text{PDHG}(z^k) + \frac{1}{k+2} z^0$ The reflected update generalizes this by using an operator of the form $(1 + \gamma)\text{PDHG} - \gamma \, \text{id}$ with $\gamma \in [0,1]$ , yielding: $z^{(k+1)} = \frac{k+1}{k+2} \big[ (1+\gamma)\text{PDHG}(z^k) - \gamma z^k \big] + \frac{1}{k+2} z^0$ Reflection enables the method to take longer steps, leveraging recent theoretical results for sharper linear convergence under suitable conditions. The integration of this update in the parallel GPU context exploits the hardware’s strength in simple, repeated vector operations and cheap projections (Lu et al., 18 Jul 2025).

b) Adaptive Restart Criterion

cuPDLP+ utilizes a new restart rule based on the trajectory of a fixed-point error metric: $r(z) = \|z - \text{PDHG}(z)\|_P$ where $P$ depends on the PDHG stepsize and the primal-dual weight. Restarts are triggered under several conditions:

Sufficient decay, where $r(z^{(n,k)})$ drops below a fraction of its epoch-starting value;
Necessary decay plus no local progress, where the error is low but non-monotonic;
Artificial epoch bound, when the iteration count within an epoch exceeds a fixed threshold. This approach ensures that the anchor for Halpern updates remains contextually relevant, improving stability and robustness, especially for challenging or degenerate LPs (Lu et al., 18 Jul 2025).

c) PID-Controlled Primal Weight Adaptation

The PDHG update utilizes a primal weight parameter $\omega$ to scale primal relative to dual steps. cuPDLP+ introduces a proportional-integral-derivative (PID) controller for automatically tuning $\omega$ , based on the observed imbalance between primal and dual progress: $e^n = \log \left( \frac{\sqrt{\omega^n} \|x^{(n, t)} - x^*\|_2}{\|y^{(n, t)} - y^*\|_2 / \sqrt{\omega^n}} \right)$ The update for the primal weight is given by: $\log w^{n+1} = \log w^n - \left( K_P e^n + K_I \sum_{i=1}^n e^i + K_D(e^n - e^{n-1}) \right)$ This dynamic, feedback-based adjustment improves the algorithm’s balance and empirical rate of convergence (Lu et al., 18 Jul 2025).

2. Implementation and GPU-Specific Design

cuPDLP+ is written in Julia and makes extensive use of the CUDA.jl interface, targeting NVIDIA GPUs such as the H100 SXM (80GB, CUDA 12.4 in benchmark settings). The implementation keeps all major data structures — including the constraint matrix (in compressed sparse row format) and iterates — fully resident on GPU memory. All core computational kernels, particularly sparse matrix–vector multiplication, vector updates, and projections, are implemented as custom GPU kernels, ensuring that the algorithm exploits massive data parallelism (Lu et al., 18 Jul 2025).

A constant stepsize $\eta = 0.99/\|A\|_2$ is chosen, with the spectral norm $\|A\|_2$ typically estimated via power iteration on the GPU. The stepsize and primal-dual scaling parameters are consistently propagated across all PDHG, Halpern, and reflection computations, minimizing parameter synchronization overhead.

3. Empirical Performance and Benchmarks

Comprehensive benchmarks on the MIPLIB 2017 LP relaxations (383 instances) reveal the practical effectiveness of cuPDLP+. Measured by both raw wall-clock time and shifted geometric mean (SGM10), cuPDLP+ achieves:

An overall 2× to 4× speedup relative to cuPDLP on the full benchmark.
In high-accuracy regimes ( $\epsilon = 10^{-8}$ , with Gurobi presolve), a speedup of 2.9× across all instances and up to 4.63× on hard cases (instances requiring >10s to solve with either solver).
Significantly higher instance completion counts within given time limits at both moderate and high tolerances.

Performance gains are especially marked for large instances (those with high dimension or nonzero count), reflecting the scalability of the approach and the effective use of GPU memory bandwidth and compute throughput (Lu et al., 18 Jul 2025).

cuPDLP+ represents an evolution over the cuPDLP family and related GPU-based LP solvers. Compared to:

Previous releases such as cuPDLP (raPDHG, basic restart, fixed or manually tuned primal weight), cuPDLP+ provides superior stability, faster convergence, and more aggressive yet well-controlled progress, due to the combined effect of reflection, adaptive restarts, and PID feedback (Lu et al., 18 Jul 2025).
Other established GPU-based solvers (including HPR-LP and PDCS variants), cuPDLP+ demonstrates comparable or lower wall-clock times on both small and large LP instances, and it offers improved accuracy in high-precision regimes (Lu et al., 18 Jul 2025, Lin et al., 1 May 2025).
Commercial solvers (such as Gurobi and COPT) in regimes where first-order methods are competitive (moderate or lower accuracy, very large models), cuPDLP+ often matches or outperforms traditional simplex or interior point solvers — especially where the latter’s cost for matrix factorization becomes prohibitive (Lu et al., 2023, Lu et al., 2023, Lu et al., 2 Jun 2025).

5. Practical Applications and Scalability

cuPDLP+ is optimized for sparse, large-scale linear programs arising in fields such as operations research, network optimization, supply chain management, portfolio optimization, and scientific computing. The design principles — matrix–vector computations, zero or minimal CPU-GPU communication during optimization, and efficient restarts for non-smooth or ill-conditioned problems — make the method well suited for industrial and scientific workloads involving millions of variables and constraints (Lu et al., 2 Jun 2025, Lin et al., 1 May 2025).

Moreover, the solver’s performance scales with both available GPU memory and floating-point throughput. On state-of-the-art hardware (NVIDIA H100 GPU with 80GB HBM3 memory), the implementation achieves high throughput and is able to handle dense or highly structured LPs that may be out of reach for traditional second-order methods due to memory or computational bottlenecks (Lu et al., 2023, Lu et al., 18 Jul 2025).

6. Broader Context and Extensions

The theoretical and architectural design of cuPDLP+ generalizes to other first-order frameworks, including convex quadratic programming, conic programming, and, with suitable operator and projection modifications, large-scale semidefinite programming and nonlinear programming (Lin et al., 1 May 2025, Lu et al., 2 Jun 2025). The integration of Halpern-type acceleration, restart criteria, and control-theoretic parameter scheduling is a template that is being adopted in other contemporary GPU-based solvers.

Within the cuPDLP+ and broader “cuPDLP+” family (Editor's term, referring to this lineage of enhanced GPU-based first-order LP solvers), ongoing research includes incorporating advanced online preconditioning (Lu et al., 21 Jun 2025), hybrid CPU–GPU algorithms, and specialization for allied problem classes, confirming the lasting impact of these algorithm-architecture codesign principles on scalable optimization.

cuPDLP+ is a state-of-the-art GPU-based linear programming solver that combines recent advances in first-order algorithm design with practical engineering for modern parallel hardware. Its performance and methodology exemplify current best practices at the intersection of numerical optimization, control theory, and high-performance GPU computation, and it has set a new benchmark for large-scale LP in both academic and applied settings (Lu et al., 18 Jul 2025).