GPU-Accelerated Primal Heuristics

Updated 18 November 2025

GPU-accelerated primal heuristics are algorithmic frameworks that exploit massively parallel GPU architectures to generate feasible solutions for complex optimization problems.
They map key operations like local search, rounding, and propagation onto GPU kernels, achieving significant speedups (up to 100×) over CPU-based methods.
Empirical outcomes demonstrate enhanced solution quality and runtime efficiency, with primal gaps reduced by up to 50% and performance comparable to commercial solvers.

GPU-accelerated primal heuristics are algorithmic frameworks designed to leverage the data-parallel and high-throughput computing capabilities of graphics processing units in order to rapidly generate feasible solutions for large-scale combinatorial and mixed-integer optimization problems. These methods retain internal feasibility of candidate solutions throughout the search, relying on the primal structure of the underlying problem, and map the key heuristic steps—local move search, rounding, bound-propagation, or augmentation—to massively parallel GPU kernels. Across integer linear programming, quadratic assignment, nonlinear integer optimization, and mixed-integer programming, GPU acceleration enables orders-of-magnitude speed-up versus CPU-only approaches, particularly when the heuristics are structured to exploit coalesced memory access, fine-grained parallel reductions, and inter-thread synchronization primitives.

1. Mathematical Formulation of Primal Heuristics

In the context of integer and mixed-integer programming, a primal heuristic constructs solutions $x \in S$ that satisfy all constraints in the feasible set, typically in the form: $S = \{x \in \mathbb{Z}^n : Ax = b,\, l \le x \le u\}$ or more generally,

$\min_{x \in \{0,1\}^n} c^\top x \quad \text{s.t.} \quad x_{I_j} \in X_j \ \forall j$

Primal heuristics—including simulated annealing, local search (2-opt, Tabu), fix-and-propagate, feasibility pump, Lagrange decomposition, and Graver-best augmentation—maintain feasibility at every step. For instance, simulated annealing for the quadratic assignment problem (QAP) keeps a valid permutation $p$ by only swapping facility assignments, while fix-and-propagate in MIP rounds a continuous relaxation and repairs until all integrality and bound constraints are satisfied (Paul, 2012, Wei et al., 31 Oct 2025, Kempke et al., 12 Mar 2025).

2. GPU Parallelization Strategies for Classical Heuristics

Effective GPU acceleration of primal heuristics requires careful mapping of algorithmic structure onto hardware primitives and memory layouts:

Δ-matrix Local Search: In QAP simulated annealing, the cost differentials for swaps (Δ-matrix) are stored as an $N \times N$ array; each GPU thread scans or updates multiple entries in parallel. Shared memory buffers "hot" rows/columns to reduce global memory access latency, and custom inter-block synchronization ensures safe staging of updates (Paul, 2012).
Thread-Block Structure: Tabu search and 2-opt heuristics assign each GPU thread to a unique candidate or permutation. Dynamic parallelism is exploited—parent threads launch child kernels for evaluating all swap moves in parallel. Warp-shuffle reduction, memory coalescing, and shared-memory tiling minimize occupancy bottlenecks and branch divergence (Novoa et al., 2023).
Bulk-Synchronous Dual/Primal Updates: In FastDOG's Lagrange decomposition, thousands of subproblems (encoded as binary decision diagrams (BDDs)) are processed in bulk-synchronous style, with atomic-min operations and deferred averaging for dual multipliers. Forward and backward shortest-path sweeps over BDDs are implemented as contiguous thread grids for coalesced memory access (Abbas et al., 2021).
Fix-and-Propagate: PDHG-based LP relaxation is fused with domain propagation and bulk rounding in one GPU pipeline, using compressed sparse row (CSR) matrix storage, full warp-level SpMV, and double propagation in fused kernels. Probing caches precompute bound effects across splits, enabling rapid infeasibility detection in the rounding phase (Çördük et al., 23 Oct 2025, Kempke et al., 12 Mar 2025).

Heuristic Class	Parallelization Method	Hardware Feature Used
Simulated Annealing, 2-opt	Δ-matrix batched updates, swaps	Warps, Shared Mem, Sync
Tabu/Local Search	Child thread move evaluation	Dynamic Parallelism
Feasibility Pump/FP	Bulk rounding + propagation	CSR SpMV, Probe Cache
Lagrange Decomposition	Bulk dual ascent on BDDs	AtomicMin, Coalesced Access
Graver Augmentation	Multi-start direction extraction	Batched GEMV, Adam on GPU

3. First-Order Methods and Randomized Sampling in GPU-Primal Heuristics

First-order methods, notably the Primal-Dual Hybrid Gradient (PDHG) algorithm, are central to recent GPU-accelerated primal heuristics. PDHG iterations consist of alternating projected gradient steps in primal and dual spaces, relying only on sparse matrix-vector products and simple box projections, which are highly amenable to GPU execution: $y^{k+1} = \min\{0,\, y^k + \sigma(A \bar x^k - b)\}$

$x^{k+1} = \mathrm{clip}_{[0,1]}\big(x^k - \tau(c + A^\top y^{k+1})\big)$

GPU implementation fuses SpMV and projection operations in a single pass over data arrays, storing $A$ and $A^\top$ in CSR/CSC format and aligning vector buffers for coalesced accesses. Streaming interleaving of PDHG and sampling minimizes CPU-GPU synchronization (Wei et al., 31 Oct 2025, Çördük et al., 23 Oct 2025, Kempke et al., 12 Mar 2025).

Randomized sampling modules, as in GFORS, batch-convert fractional PDHG iterates into feasible binary solutions by drawing Bernoulli trials, applying feasibility-aware repairs (monotone relaxations, total-unimodular subproblem rounding), and updating the incumbent. All sampling and repairs run device-side, exploiting asynchronous streams for step overlap (Wei et al., 31 Oct 2025).

4. Generalization Across Domains and Heuristic Types

The parallelization principles observed—incremental cost matrices, batch updates, coalesced access, work partitioning—are applicable beyond QAP and ILP, including:

Vehicle Routing, Facility Location, Graph Partitioning: Move costs can be incrementally maintained for local search or greedy insertion heuristics.
Nonlinear Integer Optimization: MAPLE's Graver basis extraction leverages batch Adam optimization on continuous surrogate objectives for lattice directions, with multi-start augmentation in parallel (Liu et al., 18 Dec 2024).
Mixed Integer Programming: Bulk fix-and-propagate, feasibility pump, and efficient local search are integrated via unified GPU pipelines, enabling broader applicability and scalability (Çördük et al., 23 Oct 2025, Kempke et al., 12 Mar 2025).

These frameworks yield significant speed-ups (10-100× is typical), enable massive-scale instances (up to 8 million variables and 243 million nonzeros), and attain solution quality comparable to traditional exact solvers in constrained runtimes.

5. Performance Metrics and Empirical Outcomes

GPU-accelerated primal heuristics consistently improve both runtime and solution quality on standard benchmarks:

Speedup: Simulated annealing for QAP achieves 50-100× speedup for large iteration counts; GPU-2opt and Tabu search for QAPLIB instances reach up to 63× speedup over SIMD CPU baselines (Paul, 2012, Novoa et al., 2023).
Solution Quality: Heuristics such as Tabu and FP deliver solutions within 0.15%–1.03% of best known values in QAP; GFORS shows gaps <1% after 10 s for dense facility location and max-cut, matching or bettering Gurobi under identical time limits (Wei et al., 31 Oct 2025).
Feasible Roots and Gaps: GPU fix-and-propagate and FP yield 220–221 feasible solutions on MIPLIB2017, with primal gaps of 0.22–0.23%, outperforming CPU methods by 13% in solution counts and reducing gaps by almost 50% (Çördük et al., 23 Oct 2025).

Benchmark/Method	Feasible Solutions	Primal Gap (%)	Speedup
GPU FP/Propagate	~221	0.22–0.23	13%↑
GPU-SA QAP (I=10⁷)	n/a	n/a	50–102×
GPU-2opt QAPLIB	n/a	1.03	33×
GPU-Tabu QAPLIB	n/a	0.15	19×
GFORS (10s, FL)	n/a	<1	n/a
MAPLE (QPLIB, Obj.)	n/a	near-optimal	28/30 CPLEX win

A plausible implication is that contemporary GPU-accelerated primal heuristics are now approaching, and in selected classes surpassing, the practical effectiveness of large-scale CPU and commercial solver portfolios for many real-world combinatorial and integer optimization workloads.

6. Flexibility, Limitations, and Directions for Future Research

Primal heuristics on GPUs show distinct strengths:

Solver-independence: MAPLE and GFORS require no branch-and-cut or generic MIP solvers; direction sets extracted for fixed constraint matrices can be reused across objective variations (Liu et al., 18 Dec 2024, Wei et al., 31 Oct 2025).
Scalability: Capacity to address problem sizes unmanageable by barrier or interior-point solvers, demonstrated in energy system UC models and large QAPs (Kempke et al., 12 Mar 2025).
Modularity: CUDA Graph-based scheduling, bulk rounding/propagation, and asynchronous stream overlap enable modular workflow integration, with minimal host-to-device communication.

Limitations are present:

Optimality: Most approaches yield high-quality incumbents without global optimality certificates.
Propagation bottleneck: For deepest fix-and-propagate dives, CPU-bound propagation steps and host-device synchronization may stall device-level pipelines; future work will investigate fully GPU-based bound propagation for large, deeply branched searches (Kempke et al., 12 Mar 2025).

Possible extensions include adaptive tuning of surrogate penalty parameters for Graver direction extraction, hybridization with CPU branch-and-bound methods, and exploiting problem-dependent sparsity or separability for matrix-vector computational reduction (Liu et al., 18 Dec 2024).

7. General Principles and Best Practices

Several principles recur across successful GPU-accelerated primal heuristic implementations:

Maintain incremental cost data structures to avoid exhaustive recomputation.
Partition workloads to match thread/block hardware granularity for amortized kernel overhead.
Design memory layouts for maximal coalesced access, buffering "hot" data in shared memory when feasible.
Fuse kernel steps (e.g., projection plus matrix-vector multiply, bulk rounding plus propagation) to minimize synchronization and launch cost.
Prefer bulk-synchronous update patterns and deferred reductions to eliminate fine-grained inter-thread communication bottlenecks.

These methodological choices underpin the observed scalability and efficiency of GPU-accelerated primal heuristics in large-scale optimization, and can be ported across domains and heuristic paradigms (Paul, 2012, Novoa et al., 2023, Çördük et al., 23 Oct 2025, Wei et al., 31 Oct 2025, Liu et al., 18 Dec 2024, Abbas et al., 2021, Kempke et al., 12 Mar 2025).