COMPASS: RL with Latent Space Search

Updated 16 November 2025

COMPASS is a reinforcement learning framework for combinatorial optimization that leverages a continuous latent space to generate diverse, instance-adapted policies.
It integrates a Transformer-based encoder-decoder with CMA-ES latent search to efficiently handle NP-hard tasks such as TSP, CVRP, and JSSP.
Empirical results demonstrate that COMPASS outperforms traditional RL methods, offering superior generalization and robustness under distribution shifts.

COMPASS (COMbinatorial Optimization with Policy Adaptation using Latent Space Search) is a reinforcement learning (RL) framework for combinatorial optimization that parameterizes a distribution over specialized policies conditioned on a continuous latent space. The approach is designed to address the limitations of standard RL heuristics for solving NP-hard combinatorial optimization tasks, particularly their restricted search diversity at inference and limited generalization under distribution shift. COMPASS introduces a pre-training procedure that anticipates instance-level search, enabling efficient adaptation to new or out-of-distribution problem instances at test time.

1. Problem Setting and Motivations

COMPASS targets classical NP-hard combinatorial optimization (CO) problems, including:

Travelling Salesman Problem (TSP): Find a minimum-length tour visiting each of $n$ cities exactly once.
Capacitated Vehicle Routing Problem (CVRP): Route a fleet of vehicles with limited capacity to serve all demands at minimum cost.
Job-Shop Scheduling Problem (JSSP): Sequence operations on machines to minimize overall makespan.

The cardinality of the feasible set increases exponentially with instance size $n$ , precluding exact enumeration for large-scale tasks. While RL has been investigated as a framework for learning construction heuristics, prior work predominantly focuses on training a single policy, limiting the diversity and adaptability of solutions at inference. Post-training search strategies either rely on stochastically sampling many trajectories from the same policy (yielding minor variations) or directly fine-tuning the policy on novel instances—an approach that incurs significant computational overhead. COMPASS addresses these deficiencies by pre-training a family of policies conditioned on a continuous latent space and employing principled search (CMA-ES) to adapt solutions at inference.

2. Reinforcement Learning Formulation and Latent Space Conditioning

In COMPASS, each combinatorial optimization problem is modeled as an episodic Markov Decision Process $(S, A, T, R, H)$ , where the agent constructs a solution step-by-step. The policy gradient objective is given by

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right], \qquad R(\tau) = \sum_{t=0}^{H} \gamma^t r(s_t, a_t)$

with policy $\pi_\theta(a \mid s)$ . COMPASS introduces a continuous latent variable $z \in \mathbb{R}^{16}$ , sampled uniformly over $[-1,1]^{16}$ , so the conditional policy becomes $\pi_\theta(a \mid s, z)$ . During training, for each instance, $N$ latent vectors $\{z_i\}_{i=1}^N$ are sampled, and each $\pi_\theta(\cdot \mid z_i)$ is rolled out to produce a trajectory. The “best” latent, $z_{i^*}$ , is the one yielding the highest trajectory reward. Only $z_{i^*}$ is used to update parameters via REINFORCE:

$\nabla_\theta J_{\rm compass} = \mathbb{E}_{\rho \sim \mathcal{D}} \, \mathbb{E}_{z_1, ..., z_N \sim p(z)} \, \mathbb{E}_{\tau_i \sim \pi_\theta(\cdot | z_i)}\left[\nabla_\theta \log \pi_\theta(\tau_{i^*} | z_{i^*}) (R_{i^*} - \mathcal{B})\right]$

where $\mathcal{B}$ is a learned baseline. This best-of- $N$ strategy ensures regions of latent space specialize towards different kinds of instances, promoting implicit policy diversity.

3. Policy Architecture and Parameterization of Diversity

The policy is realized as a Transformer-based encoder-decoder (adopting the architecture of POMO):

Encoder: Maps problem-specific features (e.g., coordinates, graph structure) to node embeddings.
Decoder: Generates solutions by attending to embeddings and the current partial solution.

Latent conditioning is integrated by concatenating the sampled $z$ or its affine transformation to each key, query, and/or value vector across decoder attention layers. Therefore, each distinct latent vector defines a different policy within an infinite family parameterized by $(\theta, z)$ .

At inference, the space $[-1, 1]^{16}$ is explored, where each $z$ yields a different specialized heuristic. This formulation enables the representation of a vast repertoire of implicitly learned heuristics, facilitating adaptation beyond the representational reach of a single global policy.

4. Inference-Time Latent Space Search: Covariance Matrix Adaptation

For a new problem instance $\rho$ , the goal is to efficiently identify the latent $z$ producing the highest-quality solution using $\pi_\theta(\cdot | z)$ . COMPASS applies parallel Covariance Matrix Adaptation Evolution Strategy (CMA-ES) components in the latent space:

Initialization: Multiple components $\{(\mu_j, \Sigma_j)\}$ are initialized at Voronoi centroids of $[-1,1]^{16}$ .
Iterative Search:

For each component $j$ , sample $M$ candidates $z_{j,m} \sim \mathcal{N}(\mu_j, \Sigma_j)$ .
Evaluate rollouts $\pi_\theta(\cdot|z_{j,m})$ and receive rewards $R_{j,m}$ .
Update $(\mu_j, \Sigma_j)$ using standard CMA-ES update based on $R_{j,m}$ .

Selection: After a fixed rollout budget $B$ , return the best solution found.

The computational bottleneck is dominated by $O(B)$ forward model rollouts; CMA-ES logic is negligible in wall time. This search protocol enables rapid adaptation per instance, substantially improving solution quality and generalization.

Inference-Time CMA-ES Pseudocode

Initialize components { (μ_j, Σ_j) }_{j=1}^C via Voronoi centroids.
best_solution ← ∅
for budget_step = 1…B/C do
  for each component j=1…C:
    sample {z_{j,m}_{m=1…M} ∼ N(μ_j,Σ_j)
    for m=1…M: evaluate R_{j,m} = rollout(πθ(·|z_{j,m}))
    update (μ_j,Σ_j) by CMA−ES using {R_{j,m}
    record best_solution if any R_{j,m} better
return best_solution

5. Empirical Evaluation and Benchmarks

COMPASS was evaluated against standard RL baselines (POMO, Poppy, EAS—active search fine-tuning) and industrial solvers (Concorde/LKH3 for routing, OR-Tools for scheduling) on eleven canonical benchmarks, each using an equal rollout budget of 1,600 samples per instance. Table 1 summarizes primary results for representative instance sizes:

Task	Best Industrial	POMO	Poppy	EAS	COMPASS
TSP $_{100}$	7.765 (0.000%)	7.779 (0.185%)	7.766 (0.013%)	7.778 (0.161%)	7.765 (0.002%)
CVRP $_{100}$	15.650 (0.000%)	15.713 (0.399%)	15.663 (0.084%)	15.663 (0.081%)	15.594 (−0.361%)
JSSP $10\times10$	807.6 (0.00%)	862.1 (6.7%)	849.7 (5.2%)	858.4 (6.3%)	845.5 (4.7%)

For larger instances (TSP 125/150/200, CVRP 125/150/200, JSSP 15×15/20×15), COMPASS consistently achieved the best performance among RL-based methods and often closed much of the gap to industrial solvers.

6. Distribution Shift and Robustness Analysis

Robustness is evaluated on 180 out-of-distribution (OOD) test sets generated by applying nine geometric or structural mutation operators (explosion, implosion, cluster, rotation, linear/axis projection, expansion, compression, grid), each at ten mutation strengths. COMPASS outperformed all compared methods on all OOD distributions.

For example, under the strongest TSP mutation (power = 0.9), average tour lengths and optimality gaps were:

Solver	Tour Length	Gap (%)
LKH3	6.732	0.000
POMO	6.770	0.557
Poppy 16	6.744	0.176
EAS	6.753	0.308
COMPASS	6.738	0.091

COMPASS’s degradation in gap under increasing mutation power was substantially slower than other RL-based approaches, indicating superior generalization under strong distribution shift.

7. Key Insights, Limitations, and Prospective Extensions

Conditional Policy Family: Training with a continuous latent space produces an implicit infinite family of specialized policies, each accessible by selecting an appropriate $z$ .
Instance Specialization: The best-of- $N$ objective induces structure in latent space, with neighborhoods corresponding to problem instance sub-types.
Efficient Adaptation: CMA-ES in latent space enables principled, computationally efficient search for high-performing policies per instance, precluding the need for online backpropagation.

Limitations:

Diversity is only implicit: No explicit diversity regularization is incorporated during training, potentially limiting latent coverage.
Training inefficiency: Uniform latent sampling can, in principle, miss optimal conditionings for instances.
Noisy search landscape: The latent space can be noisy; Bayesian optimization methods were less effective than CMA-ES within rollout budgets.

Future research directions include:

Incorporating an unsupervised diversity bonus (e.g., VAE-style entropy regularization) to further encourage latent space coverage.
Learning conditional priors or amortized inference networks $q(z|\rho)$ to propose promising latent candidates per instance.
Applying latent-space regularization such that smooth changes in $z$ yield gradual policy variations.
Combining COMPASS’s search with fine-tuning (as in EAS) for enhanced performance, particularly on large-scale instances.

COMPASS represents a methodological advance for RL-based combinatorial optimization by synthesizing a parametric diversity of heuristics during training and leveraging efficient black-box search during inference to realize superior performance both in-distribution and under substantial distribution shifts.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to COMPASS (COMbinatorial optimization with Policy Adaptation using Latent Space Search).