Cached Multiple Negatives Ranking Loss

Updated 12 December 2025

The paper introduces a contrastive, solver-agnostic loss that leverages a dynamic solution cache to significantly reduce optimization calls and computational cost.
It employs a ranking-based, noise-contrastive estimation framework with multiple negatives and hard-negative variants to maintain a controllable trade-off between fidelity and efficiency.
Empirical results on NP-hard tasks demonstrate that the method achieves comparable or superior decision quality with much faster training times compared to standard black-box and relaxation approaches.

The Cached Multiple Negatives Ranking Loss is a contrastive, solver-agnostic surrogate loss function designed for end-to-end learning in predict-and-optimize settings, particularly where the task involves combinatorial optimization over discrete feasible sets. It achieves significant computational savings by decoupling optimization calls from every forward pass, instead maintaining a dynamic solution cache that enables inner approximations of the feasible set. The approach provides a controllable trade-off between estimator fidelity and compute cost through probabilistic cache updates, and is empirically validated on NP-hard problems, where it matches or surpasses the predictive quality of state-of-the-art black-box and relaxation-based methods, while delivering substantial reductions in training time (Mulamba et al., 2020).

1. Formal Definition and Variants

Given a training dataset $D = \{(x_i, c_i)\}_{i=1}^n$ , for each instance $i$ :

$m(\omega, x_i) = \hat{c}_i \in \mathbb{R}^d$ is the model's predicted cost vector,
$v_i^* = \arg\min_{v \in V} f(v, c_i)$ is the optimal solution under the true cost $c_i$ ,
$S_i \subseteq V$ is a cache of feasible (non-optimal) solutions for instance $i$ .

Let $f(v, c)$ denote the task-specific objective; for typical problems (e.g., routing, matching, scheduling), $f$ is often linear, $f(v,c)=c^\top v$ . The Cached Multiple Negatives Ranking Loss comprises two contrastive variants that penalize predicted costs for assigning lower loss to the oracle solution $v_i^*$ than to any cached negative $v^s \in S_i$ :

(a) Multiple-negatives (all-pairs) loss:

$\mathcal{L}_{\mathrm{NCE}} = \sum_{i=1}^n \sum_{v^s\in S_i} \left[ f(v_i^*, m(\omega, x_i)) - f(v^s, m(\omega, x_i)) \right]$

(b) MAP (hard-negative) loss:

$\mathcal{L}_{\mathrm{MAP}} = \sum_{i=1}^n \max_{v^s\in S_i} \left[ f(v_i^*, m(\omega, x_i)) - f(v^s, m(\omega, x_i)) \right]$

For linear objectives, it is common to operate on the prediction error, $(\hat{c}_i - c_i)$ , to avoid trivial minimizers. The corresponding variants are:

$\mathcal{L}_{\mathrm{NCE}}^{(\hat{c} - c)} = \sum_{i=1}^n (m(\omega, x_i) - c_i)^\top \sum_{v^s \in S_i}(v_i^* - v^s)$

$\mathcal{L}_{\mathrm{MAP}}^{(\hat{c} - c)} = \sum_{i=1}^n (m(\omega, x_i) - c_i)^\top \left(v_i^* - \arg\min_{v^s\in S_i} f(v^s, m(\omega, x_i)) \right)$

2. Noise-Contrastive and Ranking Interpretation

The loss is rooted in a ranking-driven, noise-contrastive estimation framework. Considering the Gibbs measure

$p(v \mid \hat{c}) = \frac{1}{Z(\hat{c})} \exp\left(-f(v, \hat{c})\right), \quad Z(\hat{c}) = \sum_{v'\in V}\exp(-f(v', \hat{c}))$

the oracle solution $v_i^*$ is treated as a positive, and $v^s \in S_i$ as negatives. Maximizing the unnormalized likelihood ratio across all negatives,

$\prod_{v^s \in S_i} \frac{p(v_i^* \mid \hat{c}_i)}{p(v^s \mid \hat{c}_i)} = \exp \left( \sum_{v^s \in S_i} \big[ -f(v_i^*, \hat{c}_i) + f(v^s, \hat{c}_i) \big] \right)$

recovers, up to a sign, precisely the multiple-negatives loss definition above. The normalization constant $Z(\hat{c}_i)$ is omitted in practice, obviating the need to enumerate the full feasible set $V$ .

3. Solver-Agnostic Solution Caching and Inner Approximation

Rather than invoking the combinatorial solver on each training forward pass, this approach maintains, for each instance $i$ , a dynamic cache $S_i$ initialized with the oracle solution $v_i^*$ . Whenever the solver is evaluated with a new predicted vector $\hat{c}_i$ , the resulting solution is added to $S_i$ if not already present. Over time, $S_i$ accumulates diverse feasible solutions, forming an “inner approximation” to $\mathrm{conv}(V)$ . Unlike LP relaxations which provide an outer approximation, cache-based methods maintain integrality and leverage previously discovered structure.

	Outer Approximation	Inner Approximation
Typical method	LP relaxation	Cached $S_i$
Integrality	No	Yes
Explores $V$	No	Yes (via growing $S_i$ )

4. Cache Lookup and Training Algorithm

The method probabilistically alternates between cache lookup and full solver invocation for each training instance and epoch, thereby enabling scalable mixed-mode optimization:

Algorithm: Gradient Descent with Solution Cache
Input      : D = {(x_i, c_i)}, solver-call prob. p_solve
Initialize : model ω; caches S_i ← {v_i^*} for i = 1…n
for epoch = 1…E do
  for each (x_i,c_i) in D do
    Ċ_i ← m(ω, x_i)
    if rand() < p_solve:
        # growth step
        v ← Solver(Ċ_i)            # v in V
        S_i ← S_i ∪ {v}
    else:
        # cache lookup
        v ← argmin_{u ∈ S_i} f(u, Ċ_i)
    Compute loss ℒ (e.g. ℒ_NCE or ℒ_MAP) using {v_i^*, S_i}
    ω ← ω – η ∇_ω ℒ

The hyperparameter $p_{\mathrm{solve}} \in [0,1]$ controls the frequency of full solves. $p_{\mathrm{solve}} = 1$ recovers full-complexity training with maximal cache fidelity, while $p_{\mathrm{solve}} \approx 0$ yields a fast, static approximation. Empirically, $p_{\mathrm{solve}} = 0.05$ suffices to closely match the solution quality of the full-solve regime while greatly reducing computational burden.

5. Implementation Workflow and Hyperparameters

Implementation proceeds as follows:

Precompute $v_i^*$ for each instance, and initialize $S_i \leftarrow \{v_i^*\}$ .
Select model $m(\omega, \cdot)$ , optimizer (e.g., Adam), learning rate $\eta$ , epochs $E$ .
Set $p_{\mathrm{solve}}$ to control cache refresh (e.g., 0.05).
In each minibatch forward pass, randomly decide solver call vs. cache using $p_{\mathrm{solve}}$ .
Compute one of the contrastive losses $\mathcal{L}_{\mathrm{NCE}}^{(\hat{c}-c)}$ or $\mathcal{L}_{\mathrm{MAP}}^{(\hat{c}-c)}$ and backpropagate.
Validate on held-out data for regret, tuning $\eta$ , $p_{\mathrm{solve}}$ , and batch size.

No generalization bounds are provided, but the loss approximation improves monotonically as $|S_i|$ increases.

6. Empirical Results and Comparative Performance

Studies on canonical NP-hard tasks demonstrate that cached multiple negatives ranking losses deliver competitive, or even superior, decision regret relative to black-box differentiation (e.g., SPO+, implicit gradients) and QPTL/IPM relaxations, at markedly lower computational cost. For example, on Knapsack-120:

Full black-box (no cache): regret ≈ 528, per-epoch time ≈ 4s;
Cached $\mathcal{L}_{\mathrm{MAP}}^{(\hat{c}-c)}$ : regret ≈ 562, per-epoch time ≈ 0.5s (88% faster).

For the largest energy-aware scheduling instance, cached methods match best regret (~18,500) yet reduce epoch time from ~42s to ~1.5s. The caching wrapper is also effective for SPO+ and black-box methods, reducing per-epoch cost by up to an order-of-magnitude with negligible impact on solution quality. This suggests that the caching strategy generalizes as a modular performance improvement for a family of predict-and-optimize workflows.

7. Significance, Limitations, and Extensions

The Cached Multiple Negatives Ranking Loss provides a principled framework for integrating contrastive learning objectives with combinatorial optimization under uncertainty. It does not require relaxation of integrality constraints and is solver-agnostic, with a simple hyperparameter governing the trade-off between cost and fidelity. While no theoretical generalization bounds are presently given, loss approximation is guaranteed to improve as caches grow. A plausible implication is that, for large and complex feasible sets, the approach can scale to regimes where frequent optimization is otherwise infeasible, without loss of predictive or prescriptive quality (Mulamba et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

Contrastive Losses and Solution Caching for Predict-and-Optimize (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cached Multiple Negatives Ranking Loss.