GPU-Accelerated PDLP Optimization
- GPU-accelerated PDLP is a parallelized primal-dual optimization approach that reformulates linear programs as saddle-point problems to leverage GPU efficiency.
- It utilizes sparse matrix–vector multiplications and custom GPU kernels, achieving significant speedups over traditional CPU-based solvers.
- Advanced features such as restarted schemes, Halpern-type acceleration, and adaptive step-sizing enhance convergence and scalability for large-scale applications.
GPU-accelerated Primal-Dual Linear Programming (PDLP) refers to a class of first-order optimization methods—especially the Primal-Dual Hybrid Gradient (PDHG) family—deployed on graphics processing units (GPUs) to solve large-scale linear programming (LP) problems. Utilizing the inherent parallelism and memory bandwidth of modern GPUs, recent developments such as cuPDLP, cuPDLP-C, cuPDLP+, and adaptations for both NVIDIA and AMD hardware have significantly advanced the scalability, robustness, and empirical performance of first-order LP solvers. This paradigm has shifted the longstanding computational landscape dominated by simplex and interior-point algorithms toward highly parallelizable, factorization-free approaches. The following sections survey the foundational principles, algorithmic structures, empirical benchmarks, implementation strategies, theoretical frameworks, and broader implications of GPU-Accelerated PDLP.
1. Algorithmic Foundations and Reformulation
The underlying algorithm in GPU-Accelerated PDLP is the Primal-Dual Hybrid Gradient (PDHG) method and its restarted/accelerated variants. The classical linear program is reformulated as a saddle-point problem to expose both primal and dual variables to iterative updates based on matrix-vector multiplications and projections—operations highly amenable to GPU parallelism. A canonical PDHG iteration (in its averaged or Halpern-accelerated form) is:
with step-sizes , scaling parameter and primal weight , and projectors encoding the variable bounds and dual feasibility. Matrix concatenates the equality and inequality constraints; encodes their right-hand sides. Enhanced schemes such as cuPDLP+ employ reflected and restarted Halpern-type averages:
with . Fixed-point residuals, KKT error metrics, or primal-dual gap serve as the backbone for monitoring convergence and triggering restarts.
2. GPU-Centric Optimizations and Kernel Design
A central tenet of GPU-Accelerated PDLP solvers is maximizing in-device computation, thus minimizing expensive host-device data transfers. The algorithm's dominance of sparse matrix–vector multiplications (SpMV) and vector updates allows nearly total mapping of computational flow to GPU. Key strategies include:
- Storage of the constraint matrix in compressed sparse row (CSR) format.
- Invocation of high-throughput GPU libraries, e.g., cusparseSpMV() with algorithm CUSPARSE_SPMV_CSR_ALG2 on NVIDIA and optimized PyTorch/ROCm backends on AMD (Hu et al., 22 Aug 2025).
- Custom kernel code for per-coordinate updates, reflecting the SIMD architecture and efficiently assigning warps/threads.
- Warm-starting techniques leverage solution continuity across related subproblems, as in the use within Feasibility Pump heuristics for MIP (Çördük et al., 23 Oct 2025).
Empirical tuning of step-size () and primal weight () on modern GPUs is critical; recent innovations include PID-controlled adaptive updates to and selection of constant (using approximate spectral norm computed via power iteration).
3. Algorithmic Enhancements: Restarts, Acceleration, and Adaptivity
Outperforming vanilla PDHG in practice requires multiple layers of heuristic and theoretical refinements:
- Restarted schemes: Progress is measured via either fixed-point residuals , the KKT error, or a normalized duality gap. When reduction in stagnates or fails to meet preset decay thresholds, a restart is triggered (Lu et al., 18 Jul 2025).
- Halpern-type acceleration and reflection: Halpern steps anchor iterates to the initialization, while reflection steps (e.g., replacing with ) double the contraction factor, substantially accelerating convergence toward high accuracy (Lu et al., 18 Jul 2025).
- Preconditioning: Scaling via diagonal matrices (e.g., Ruiz equilibration, Chambolle–Pock scaling) is employed to improve matrix condition numbers, enhancing both accuracy and convergence speed (Lu et al., 2 Jun 2025).
- Projected updates and infeasibility detection: Custom projections enforce all bounds and feasibility constraints. The analysis of step differences provides infeasibility certificates and characterization under unboundedness (Lu et al., 2 Jun 2025).
- PID-controlled primal weight: A proportional-integral-derivative mechanism dynamically tunes to adapt the rate of progress between primal and dual spaces (Lu et al., 18 Jul 2025).
4. Empirical Performance and Benchmarking
Comprehensive benchmarks indicate that GPU-Accelerated PDLP solvers exhibit performance either comparable to or exceeding state-of-the-art commercial solvers (e.g., Gurobi, COPT) on large-scale instances. Key findings include:
| Solver | Hardware | Benchmark Set | # Solved (𝜖=10⁻⁴) | Relative Performance | Comments |
|---|---|---|---|---|---|
| cuPDLP.jl | NVIDIA H100 | MIPLIB 2017 LP Relax | ~all | On par with Gurobi | 3–20× faster than CPU PDLP for large cases |
| cuPDLP-C | NVIDIA H100 80GB | Mittelmann’s LP | 379/383 | 2–4× slower than COPT | 16.5 h (COPT) vs 916 s (cuPDLP-C) on 'zib03' |
| cuPDLP+ | NVIDIA, large scale | MIPLIB 2017 LP Relax | ~all | 2–4× faster than cuPDLP | Improvements strongest w/ high-accuracy |
| AMD PDHG | AMD MI325X | SCED + Netlib | N/A | 36× over CPU baseline | PyTorch/ROCm implementation |
As problem dimensionality increases (e.g., nonzeros), first-order GPU-accelerated solvers consistently outperform interior-point or simplex-based routines due to factorization bottlenecks (Lu et al., 2023, Lu et al., 2 Jun 2025, Hu et al., 22 Aug 2025). In mixed-integer programming (MIP), GPU-accelerated PDLP leveraged within primal heuristics enables more extensive search and faster convergence, resulting in 221 feasible solutions and a 22% primal gap on MIPLIB2017 presolved datasets (Çördük et al., 23 Oct 2025).
5. Implementation Ecosystem and Hardware Portability
The cuPDLP series exemplifies modern engineering for large-scale mathematical programming:
- Programming languages: Julia (cuPDLP.jl), C (cuPDLP-C), Python with PyTorch/ROCm for portable AMD/NVIDIA support (Hu et al., 22 Aug 2025).
- Libraries: CUDA.jl and cusparse/cuBLAS for NVIDIA, with PyTorch tensor libraries offering ready ROCm support for AMD GPUs.
- Kernel strategies: Custom CUDA kernels, single-dimension thread blocks, and memory-resident sparse data structures.
- Presolve and integration: Advanced presolve routines are included in C implementations (cuPDLP-C) to narrow problem size prior to iterations (Lu et al., 2023).
Cross-platform compatibility is highlighted by successful implementations on both NVIDIA (cuPDLP, cuPDLP+, cuPDLP-C) and AMD (PDHG with PyTorch/ROCm) hardware, demonstrating scalable performance and software portability (Hu et al., 22 Aug 2025).
6. Theoretical Guarantees and Analytical Framework
The theoretical underpinnings of GPU-Accelerated PDLP are substantiated by:
- Operator splitting perspective: PDHG is interpreted as preconditioned Douglas–Rachford splitting on the monotone KKT operator, guiding its convergence and acceleration properties (Lu et al., 2 Jun 2025).
- Convergence rates:
- Sublinear (ergodic): for average iterate primal-dual gap.
- Sublinear (last iterate): for fixed-point residual and primal-dual gap.
- Linear (sharpness): Under sharpness conditions (e.g., following Hoffman’s lemma), the KKT residual is lower-bounded by the distance to the solution set, yielding iteration complexity (Lu et al., 2 Jun 2025).
- Restarts and acceleration: Theoretical motivation links Halpern and reflected accelerations to improved contraction factors and confirms global linear convergence in the presence of problem sharpness.
- Infeasibility detection: Convergence of step differences to an infimal displacement certificate enables early detection of infeasible or unbounded instances (Lu et al., 2 Jun 2025).
7. Extensions and Domain Impact
The GPU-Accelerated PDLP paradigm generalizes to broader problem classes, including:
- Quadratic Programming: Momentum-accelerated PDHG (PDHCG) and PDQP handle large-scale QPs with strong GPU scalability and fast empirical convergence (Lu et al., 2 Jun 2025).
- Semidefinite and Conic Programming: Low-rank variable factorization (Burer–Monteiro) and first-order ADMM methods (e.g., cuLoRADS, ALORA) enable the solution of very large SDPs.
- Nonlinear Programming: GPU-enabled condensed-space interior-point algorithms, as in MadNLP, minimize dependency on serial factorizations by restructuring nonlinear KKT systems into matrix-vector computations.
- Mixed Integer Programming: Integration with GPU-accelerated PDLP as projection or relaxation oracles within primal heuristics (Feasibility Pump, Fix-and-Propagate, Efficient Local Search) significantly boosts the number of feasible integer solutions and narrows primal gaps on challenging benchmarks (Çördük et al., 23 Oct 2025).
- Industrial Optimization: Applications in SCED (power systems engineering), supply chain, finance, and logistics directly benefit due to their large, sparse LP structure and need for real-time or near-real-time solutions (Hu et al., 22 Aug 2025).
Summary Table: Key GPU-Accelerated PDLP Developments
| Solver | Language | Hardware | Principal Innovations | Benchmark Highlights |
|---|---|---|---|---|
| cuPDLP.jl | Julia + CUDA | NVIDIA | Restarted PDHG, GPU-native SpMV, KKT-based restart | Gurobi-comparable, 3–20× vs CPU |
| cuPDLP-C | C + CUDA | NVIDIA | Advanced presolve, C-optimization, full KKT monitoring | Superior for very large LPs |
| cuPDLP+ | C/Julia + CUDA | NVIDIA | Reflected Halpern PDHG, PID update, FP-residual restart | 2–4× faster than cuPDLP |
| AMD PDHG | PyTorch + ROCm | AMD | Cross-platform, "fishnet casting," adaptive step-size | 36× speedup on MI325X vs CPU |
| FP/MIP-Heur. | CUDA/C++ | NVIDIA | GPU-ELS, probing cache, bulk rounding, parallel propagation | 221 feasible, 22% gap (MIPLIB2017) |
GPU-Accelerated PDLP constitutes a scalable, highly parallel alternative to traditional factorization-based LP solvers, delivering strong empirical and theoretical performance, especially on modern GPU hardware, and enabling practical solution of previously intractable large-scale problems across diverse domains.