COMPASS: RL with Latent Space Search
- COMPASS is a reinforcement learning framework for combinatorial optimization that leverages a continuous latent space to generate diverse, instance-adapted policies.
- It integrates a Transformer-based encoder-decoder with CMA-ES latent search to efficiently handle NP-hard tasks such as TSP, CVRP, and JSSP.
- Empirical results demonstrate that COMPASS outperforms traditional RL methods, offering superior generalization and robustness under distribution shifts.
COMPASS (COMbinatorial Optimization with Policy Adaptation using Latent Space Search) is a reinforcement learning (RL) framework for combinatorial optimization that parameterizes a distribution over specialized policies conditioned on a continuous latent space. The approach is designed to address the limitations of standard RL heuristics for solving NP-hard combinatorial optimization tasks, particularly their restricted search diversity at inference and limited generalization under distribution shift. COMPASS introduces a pre-training procedure that anticipates instance-level search, enabling efficient adaptation to new or out-of-distribution problem instances at test time.
1. Problem Setting and Motivations
COMPASS targets classical NP-hard combinatorial optimization (CO) problems, including:
- Travelling Salesman Problem (TSP): Find a minimum-length tour visiting each of cities exactly once.
- Capacitated Vehicle Routing Problem (CVRP): Route a fleet of vehicles with limited capacity to serve all demands at minimum cost.
- Job-Shop Scheduling Problem (JSSP): Sequence operations on machines to minimize overall makespan.
The cardinality of the feasible set increases exponentially with instance size , precluding exact enumeration for large-scale tasks. While RL has been investigated as a framework for learning construction heuristics, prior work predominantly focuses on training a single policy, limiting the diversity and adaptability of solutions at inference. Post-training search strategies either rely on stochastically sampling many trajectories from the same policy (yielding minor variations) or directly fine-tuning the policy on novel instances—an approach that incurs significant computational overhead. COMPASS addresses these deficiencies by pre-training a family of policies conditioned on a continuous latent space and employing principled search (CMA-ES) to adapt solutions at inference.
2. Reinforcement Learning Formulation and Latent Space Conditioning
In COMPASS, each combinatorial optimization problem is modeled as an episodic Markov Decision Process , where the agent constructs a solution step-by-step. The policy gradient objective is given by
with policy . COMPASS introduces a continuous latent variable , sampled uniformly over , so the conditional policy becomes . During training, for each instance, latent vectors are sampled, and each is rolled out to produce a trajectory. The “best” latent, , is the one yielding the highest trajectory reward. Only is used to update parameters via REINFORCE:
where is a learned baseline. This best-of- strategy ensures regions of latent space specialize towards different kinds of instances, promoting implicit policy diversity.
3. Policy Architecture and Parameterization of Diversity
The policy is realized as a Transformer-based encoder-decoder (adopting the architecture of POMO):
- Encoder: Maps problem-specific features (e.g., coordinates, graph structure) to node embeddings.
- Decoder: Generates solutions by attending to embeddings and the current partial solution.
Latent conditioning is integrated by concatenating the sampled or its affine transformation to each key, query, and/or value vector across decoder attention layers. Therefore, each distinct latent vector defines a different policy within an infinite family parameterized by .
At inference, the space is explored, where each yields a different specialized heuristic. This formulation enables the representation of a vast repertoire of implicitly learned heuristics, facilitating adaptation beyond the representational reach of a single global policy.
4. Inference-Time Latent Space Search: Covariance Matrix Adaptation
For a new problem instance , the goal is to efficiently identify the latent producing the highest-quality solution using . COMPASS applies parallel Covariance Matrix Adaptation Evolution Strategy (CMA-ES) components in the latent space:
- Initialization: Multiple components are initialized at Voronoi centroids of .
- Iterative Search:
- For each component , sample candidates .
- Evaluate rollouts and receive rewards .
- Update using standard CMA-ES update based on .
- Selection: After a fixed rollout budget , return the best solution found.
The computational bottleneck is dominated by forward model rollouts; CMA-ES logic is negligible in wall time. This search protocol enables rapid adaptation per instance, substantially improving solution quality and generalization.
Inference-Time CMA-ES Pseudocode
1 2 3 4 5 6 7 8 9 |
Initialize components { (μ_j, Σ_j) }_{j=1}^C via Voronoi centroids.
best_solution ← ∅
for budget_step = 1…B/C do
for each component j=1…C:
sample {z_{j,m}_{m=1…M} ∼ N(μ_j,Σ_j)
for m=1…M: evaluate R_{j,m} = rollout(πθ(·|z_{j,m}))
update (μ_j,Σ_j) by CMA−ES using {R_{j,m}
record best_solution if any R_{j,m} better
return best_solution |
5. Empirical Evaluation and Benchmarks
COMPASS was evaluated against standard RL baselines (POMO, Poppy, EAS—active search fine-tuning) and industrial solvers (Concorde/LKH3 for routing, OR-Tools for scheduling) on eleven canonical benchmarks, each using an equal rollout budget of 1,600 samples per instance. Table 1 summarizes primary results for representative instance sizes:
| Task | Best Industrial | POMO | Poppy | EAS | COMPASS |
|---|---|---|---|---|---|
| TSP | 7.765 (0.000%) | 7.779 (0.185%) | 7.766 (0.013%) | 7.778 (0.161%) | 7.765 (0.002%) |
| CVRP | 15.650 (0.000%) | 15.713 (0.399%) | 15.663 (0.084%) | 15.663 (0.081%) | 15.594 (−0.361%) |
| JSSP | 807.6 (0.00%) | 862.1 (6.7%) | 849.7 (5.2%) | 858.4 (6.3%) | 845.5 (4.7%) |
For larger instances (TSP 125/150/200, CVRP 125/150/200, JSSP 15×15/20×15), COMPASS consistently achieved the best performance among RL-based methods and often closed much of the gap to industrial solvers.
6. Distribution Shift and Robustness Analysis
Robustness is evaluated on 180 out-of-distribution (OOD) test sets generated by applying nine geometric or structural mutation operators (explosion, implosion, cluster, rotation, linear/axis projection, expansion, compression, grid), each at ten mutation strengths. COMPASS outperformed all compared methods on all OOD distributions.
For example, under the strongest TSP mutation (power = 0.9), average tour lengths and optimality gaps were:
| Solver | Tour Length | Gap (%) |
|---|---|---|
| LKH3 | 6.732 | 0.000 |
| POMO | 6.770 | 0.557 |
| Poppy 16 | 6.744 | 0.176 |
| EAS | 6.753 | 0.308 |
| COMPASS | 6.738 | 0.091 |
COMPASS’s degradation in gap under increasing mutation power was substantially slower than other RL-based approaches, indicating superior generalization under strong distribution shift.
7. Key Insights, Limitations, and Prospective Extensions
- Conditional Policy Family: Training with a continuous latent space produces an implicit infinite family of specialized policies, each accessible by selecting an appropriate .
- Instance Specialization: The best-of- objective induces structure in latent space, with neighborhoods corresponding to problem instance sub-types.
- Efficient Adaptation: CMA-ES in latent space enables principled, computationally efficient search for high-performing policies per instance, precluding the need for online backpropagation.
Limitations:
- Diversity is only implicit: No explicit diversity regularization is incorporated during training, potentially limiting latent coverage.
- Training inefficiency: Uniform latent sampling can, in principle, miss optimal conditionings for instances.
- Noisy search landscape: The latent space can be noisy; Bayesian optimization methods were less effective than CMA-ES within rollout budgets.
Future research directions include:
- Incorporating an unsupervised diversity bonus (e.g., VAE-style entropy regularization) to further encourage latent space coverage.
- Learning conditional priors or amortized inference networks to propose promising latent candidates per instance.
- Applying latent-space regularization such that smooth changes in yield gradual policy variations.
- Combining COMPASS’s search with fine-tuning (as in EAS) for enhanced performance, particularly on large-scale instances.
COMPASS represents a methodological advance for RL-based combinatorial optimization by synthesizing a parametric diversity of heuristics during training and leveraging efficient black-box search during inference to realize superior performance both in-distribution and under substantial distribution shifts.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free