Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 160 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

GPU-Accelerated PDLP Optimization

Updated 24 October 2025
  • GPU-accelerated PDLP is a parallelized primal-dual optimization approach that reformulates linear programs as saddle-point problems to leverage GPU efficiency.
  • It utilizes sparse matrix–vector multiplications and custom GPU kernels, achieving significant speedups over traditional CPU-based solvers.
  • Advanced features such as restarted schemes, Halpern-type acceleration, and adaptive step-sizing enhance convergence and scalability for large-scale applications.

GPU-accelerated Primal-Dual Linear Programming (PDLP) refers to a class of first-order optimization methods—especially the Primal-Dual Hybrid Gradient (PDHG) family—deployed on graphics processing units (GPUs) to solve large-scale linear programming (LP) problems. Utilizing the inherent parallelism and memory bandwidth of modern GPUs, recent developments such as cuPDLP, cuPDLP-C, cuPDLP+, and adaptations for both NVIDIA and AMD hardware have significantly advanced the scalability, robustness, and empirical performance of first-order LP solvers. This paradigm has shifted the longstanding computational landscape dominated by simplex and interior-point algorithms toward highly parallelizable, factorization-free approaches. The following sections survey the foundational principles, algorithmic structures, empirical benchmarks, implementation strategies, theoretical frameworks, and broader implications of GPU-Accelerated PDLP.

1. Algorithmic Foundations and Reformulation

The underlying algorithm in GPU-Accelerated PDLP is the Primal-Dual Hybrid Gradient (PDHG) method and its restarted/accelerated variants. The classical linear program is reformulated as a saddle-point problem to expose both primal and dual variables to iterative updates based on matrix-vector multiplications and projections—operations highly amenable to GPU parallelism. A canonical PDHG iteration (in its averaged or Halpern-accelerated form) is:

x(k+1)=projX(x(k)τ(cKy(k))) y(k+1)=projY(y(k)+σ(qK(2x(k+1)x(k))))\begin{aligned} &x^{(k+1)} = \mathrm{proj}_{X}\left(x^{(k)} - \tau (c - K^\top y^{(k)})\right) \ &y^{(k+1)} = \mathrm{proj}_{Y}\left(y^{(k)} + \sigma (q - K(2x^{(k+1)} - x^{(k)}))\right) \end{aligned}

with step-sizes τ,σ\tau, \sigma, scaling parameter η\eta and primal weight ω\omega, and projectors encoding the variable bounds and dual feasibility. Matrix KK concatenates the equality and inequality constraints; qq encodes their right-hand sides. Enhanced schemes such as cuPDLP+ employ reflected and restarted Halpern-type averages:

z(k+1)=k+1k+2((1+γ)PDHG(z(k))γz(k))+1k+2z0,z^{(k+1)} = \frac{k+1}{k+2} \cdot \big((1+\gamma) \cdot \mathrm{PDHG}(z^{(k)}) - \gamma z^{(k)}\big) + \frac{1}{k+2} z^0,

with γ[0,1]\gamma \in [0,1]. Fixed-point residuals, KKT error metrics, or primal-dual gap serve as the backbone for monitoring convergence and triggering restarts.

2. GPU-Centric Optimizations and Kernel Design

A central tenet of GPU-Accelerated PDLP solvers is maximizing in-device computation, thus minimizing expensive host-device data transfers. The algorithm's dominance of sparse matrix–vector multiplications (SpMV) and vector updates allows nearly total mapping of computational flow to GPU. Key strategies include:

  • Storage of the constraint matrix in compressed sparse row (CSR) format.
  • Invocation of high-throughput GPU libraries, e.g., cusparseSpMV() with algorithm CUSPARSE_SPMV_CSR_ALG2 on NVIDIA and optimized PyTorch/ROCm backends on AMD (Hu et al., 22 Aug 2025).
  • Custom kernel code for per-coordinate updates, reflecting the SIMD architecture and efficiently assigning warps/threads.
  • Warm-starting techniques leverage solution continuity across related subproblems, as in the use within Feasibility Pump heuristics for MIP (Çördük et al., 23 Oct 2025).

Empirical tuning of step-size (η\eta) and primal weight (ω\omega) on modern GPUs is critical; recent innovations include PID-controlled adaptive updates to ω\omega and selection of constant η=0.99/A2\eta = 0.99 / \|A\|_2 (using approximate spectral norm computed via power iteration).

3. Algorithmic Enhancements: Restarts, Acceleration, and Adaptivity

Outperforming vanilla PDHG in practice requires multiple layers of heuristic and theoretical refinements:

  • Restarted schemes: Progress is measured via either fixed-point residuals r(z)=zPDHG(z)Pr(z) = \|z - \mathrm{PDHG}(z)\|_P, the KKT error, or a normalized duality gap. When reduction in r(z)r(z) stagnates or fails to meet preset decay thresholds, a restart is triggered (Lu et al., 18 Jul 2025).
  • Halpern-type acceleration and reflection: Halpern steps anchor iterates to the initialization, while reflection steps (e.g., replacing PDHG(z)\mathrm{PDHG}(z) with (1+γ)PDHG(z)γz(1+\gamma)\mathrm{PDHG}(z) - \gamma z) double the contraction factor, substantially accelerating convergence toward high accuracy (Lu et al., 18 Jul 2025).
  • Preconditioning: Scaling via diagonal matrices (e.g., Ruiz equilibration, Chambolle–Pock scaling) is employed to improve matrix condition numbers, enhancing both accuracy and convergence speed (Lu et al., 2 Jun 2025).
  • Projected updates and infeasibility detection: Custom projections enforce all bounds and feasibility constraints. The analysis of step differences (zk+1zk)(z^{k+1} - z^k) provides infeasibility certificates and characterization under unboundedness (Lu et al., 2 Jun 2025).
  • PID-controlled primal weight: A proportional-integral-derivative mechanism dynamically tunes ω\omega to adapt the rate of progress between primal and dual spaces (Lu et al., 18 Jul 2025).

4. Empirical Performance and Benchmarking

Comprehensive benchmarks indicate that GPU-Accelerated PDLP solvers exhibit performance either comparable to or exceeding state-of-the-art commercial solvers (e.g., Gurobi, COPT) on large-scale instances. Key findings include:

Solver Hardware Benchmark Set # Solved (𝜖=10⁻⁴) Relative Performance Comments
cuPDLP.jl NVIDIA H100 MIPLIB 2017 LP Relax ~all On par with Gurobi 3–20× faster than CPU PDLP for large cases
cuPDLP-C NVIDIA H100 80GB Mittelmann’s LP 379/383 2–4× slower than COPT 16.5 h (COPT) vs 916 s (cuPDLP-C) on 'zib03'
cuPDLP+ NVIDIA, large scale MIPLIB 2017 LP Relax ~all 2–4× faster than cuPDLP Improvements strongest w/ high-accuracy
AMD PDHG AMD MI325X SCED + Netlib N/A 36× over CPU baseline PyTorch/ROCm implementation

As problem dimensionality increases (e.g., >107>10^7 nonzeros), first-order GPU-accelerated solvers consistently outperform interior-point or simplex-based routines due to factorization bottlenecks (Lu et al., 2023, Lu et al., 2 Jun 2025, Hu et al., 22 Aug 2025). In mixed-integer programming (MIP), GPU-accelerated PDLP leveraged within primal heuristics enables more extensive search and faster convergence, resulting in 221 feasible solutions and a 22% primal gap on MIPLIB2017 presolved datasets (Çördük et al., 23 Oct 2025).

5. Implementation Ecosystem and Hardware Portability

The cuPDLP series exemplifies modern engineering for large-scale mathematical programming:

  • Programming languages: Julia (cuPDLP.jl), C (cuPDLP-C), Python with PyTorch/ROCm for portable AMD/NVIDIA support (Hu et al., 22 Aug 2025).
  • Libraries: CUDA.jl and cusparse/cuBLAS for NVIDIA, with PyTorch tensor libraries offering ready ROCm support for AMD GPUs.
  • Kernel strategies: Custom CUDA kernels, single-dimension thread blocks, and memory-resident sparse data structures.
  • Presolve and integration: Advanced presolve routines are included in C implementations (cuPDLP-C) to narrow problem size prior to iterations (Lu et al., 2023).

Cross-platform compatibility is highlighted by successful implementations on both NVIDIA (cuPDLP, cuPDLP+, cuPDLP-C) and AMD (PDHG with PyTorch/ROCm) hardware, demonstrating scalable performance and software portability (Hu et al., 22 Aug 2025).

6. Theoretical Guarantees and Analytical Framework

The theoretical underpinnings of GPU-Accelerated PDLP are substantiated by:

  • Operator splitting perspective: PDHG is interpreted as preconditioned Douglas–Rachford splitting on the monotone KKT operator, guiding its convergence and acceleration properties (Lu et al., 2 Jun 2025).
  • Convergence rates:
    • Sublinear (ergodic): O(1/k)\mathcal{O}(1/k) for average iterate primal-dual gap.
    • Sublinear (last iterate): O(1/k)\mathcal{O}(1/\sqrt{k}) for fixed-point residual and primal-dual gap.
    • Linear (sharpness): Under sharpness conditions (e.g., following Hoffman’s lemma), the KKT residual is lower-bounded by the distance to the solution set, yielding iteration complexity O(κlog(1/ϵ))\mathcal{O}(\kappa \log(1/\epsilon)) (Lu et al., 2 Jun 2025).
  • Restarts and acceleration: Theoretical motivation links Halpern and reflected accelerations to improved contraction factors and confirms global linear convergence in the presence of problem sharpness.
  • Infeasibility detection: Convergence of step differences to an infimal displacement certificate enables early detection of infeasible or unbounded instances (Lu et al., 2 Jun 2025).

7. Extensions and Domain Impact

The GPU-Accelerated PDLP paradigm generalizes to broader problem classes, including:

  • Quadratic Programming: Momentum-accelerated PDHG (PDHCG) and PDQP handle large-scale QPs with strong GPU scalability and fast empirical convergence (Lu et al., 2 Jun 2025).
  • Semidefinite and Conic Programming: Low-rank variable factorization (Burer–Monteiro) and first-order ADMM methods (e.g., cuLoRADS, ALORA) enable the solution of very large SDPs.
  • Nonlinear Programming: GPU-enabled condensed-space interior-point algorithms, as in MadNLP, minimize dependency on serial factorizations by restructuring nonlinear KKT systems into matrix-vector computations.
  • Mixed Integer Programming: Integration with GPU-accelerated PDLP as projection or relaxation oracles within primal heuristics (Feasibility Pump, Fix-and-Propagate, Efficient Local Search) significantly boosts the number of feasible integer solutions and narrows primal gaps on challenging benchmarks (Çördük et al., 23 Oct 2025).
  • Industrial Optimization: Applications in SCED (power systems engineering), supply chain, finance, and logistics directly benefit due to their large, sparse LP structure and need for real-time or near-real-time solutions (Hu et al., 22 Aug 2025).

Summary Table: Key GPU-Accelerated PDLP Developments

Solver Language Hardware Principal Innovations Benchmark Highlights
cuPDLP.jl Julia + CUDA NVIDIA Restarted PDHG, GPU-native SpMV, KKT-based restart Gurobi-comparable, 3–20× vs CPU
cuPDLP-C C + CUDA NVIDIA Advanced presolve, C-optimization, full KKT monitoring Superior for very large LPs
cuPDLP+ C/Julia + CUDA NVIDIA Reflected Halpern PDHG, PID update, FP-residual restart 2–4× faster than cuPDLP
AMD PDHG PyTorch + ROCm AMD Cross-platform, "fishnet casting," adaptive step-size 36× speedup on MI325X vs CPU
FP/MIP-Heur. CUDA/C++ NVIDIA GPU-ELS, probing cache, bulk rounding, parallel propagation 221 feasible, 22% gap (MIPLIB2017)

GPU-Accelerated PDLP constitutes a scalable, highly parallel alternative to traditional factorization-based LP solvers, delivering strong empirical and theoretical performance, especially on modern GPU hardware, and enabling practical solution of previously intractable large-scale problems across diverse domains.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to GPU-Accelerated PDLP.