Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cached Multiple Negatives Ranking Loss

Updated 12 December 2025
  • The paper introduces a contrastive, solver-agnostic loss that leverages a dynamic solution cache to significantly reduce optimization calls and computational cost.
  • It employs a ranking-based, noise-contrastive estimation framework with multiple negatives and hard-negative variants to maintain a controllable trade-off between fidelity and efficiency.
  • Empirical results on NP-hard tasks demonstrate that the method achieves comparable or superior decision quality with much faster training times compared to standard black-box and relaxation approaches.

The Cached Multiple Negatives Ranking Loss is a contrastive, solver-agnostic surrogate loss function designed for end-to-end learning in predict-and-optimize settings, particularly where the task involves combinatorial optimization over discrete feasible sets. It achieves significant computational savings by decoupling optimization calls from every forward pass, instead maintaining a dynamic solution cache that enables inner approximations of the feasible set. The approach provides a controllable trade-off between estimator fidelity and compute cost through probabilistic cache updates, and is empirically validated on NP-hard problems, where it matches or surpasses the predictive quality of state-of-the-art black-box and relaxation-based methods, while delivering substantial reductions in training time (Mulamba et al., 2020).

1. Formal Definition and Variants

Given a training dataset D={(xi,ci)}i=1nD = \{(x_i, c_i)\}_{i=1}^n, for each instance ii:

  • m(ω,xi)=c^iRdm(\omega, x_i) = \hat{c}_i \in \mathbb{R}^d is the model's predicted cost vector,
  • vi=argminvVf(v,ci)v_i^* = \arg\min_{v \in V} f(v, c_i) is the optimal solution under the true cost cic_i,
  • SiVS_i \subseteq V is a cache of feasible (non-optimal) solutions for instance ii.

Let f(v,c)f(v, c) denote the task-specific objective; for typical problems (e.g., routing, matching, scheduling), ff is often linear, f(v,c)=cvf(v,c)=c^\top v. The Cached Multiple Negatives Ranking Loss comprises two contrastive variants that penalize predicted costs for assigning lower loss to the oracle solution viv_i^* than to any cached negative vsSiv^s \in S_i:

(a) Multiple-negatives (all-pairs) loss:

LNCE=i=1nvsSi[f(vi,m(ω,xi))f(vs,m(ω,xi))]\mathcal{L}_{\mathrm{NCE}} = \sum_{i=1}^n \sum_{v^s\in S_i} \left[ f(v_i^*, m(\omega, x_i)) - f(v^s, m(\omega, x_i)) \right]

(b) MAP (hard-negative) loss:

LMAP=i=1nmaxvsSi[f(vi,m(ω,xi))f(vs,m(ω,xi))]\mathcal{L}_{\mathrm{MAP}} = \sum_{i=1}^n \max_{v^s\in S_i} \left[ f(v_i^*, m(\omega, x_i)) - f(v^s, m(\omega, x_i)) \right]

For linear objectives, it is common to operate on the prediction error, (c^ici)(\hat{c}_i - c_i), to avoid trivial minimizers. The corresponding variants are:

LNCE(c^c)=i=1n(m(ω,xi)ci)vsSi(vivs)\mathcal{L}_{\mathrm{NCE}}^{(\hat{c} - c)} = \sum_{i=1}^n (m(\omega, x_i) - c_i)^\top \sum_{v^s \in S_i}(v_i^* - v^s)

LMAP(c^c)=i=1n(m(ω,xi)ci)(viargminvsSif(vs,m(ω,xi)))\mathcal{L}_{\mathrm{MAP}}^{(\hat{c} - c)} = \sum_{i=1}^n (m(\omega, x_i) - c_i)^\top \left(v_i^* - \arg\min_{v^s\in S_i} f(v^s, m(\omega, x_i)) \right)

2. Noise-Contrastive and Ranking Interpretation

The loss is rooted in a ranking-driven, noise-contrastive estimation framework. Considering the Gibbs measure

p(vc^)=1Z(c^)exp(f(v,c^)),Z(c^)=vVexp(f(v,c^))p(v \mid \hat{c}) = \frac{1}{Z(\hat{c})} \exp\left(-f(v, \hat{c})\right), \quad Z(\hat{c}) = \sum_{v'\in V}\exp(-f(v', \hat{c}))

the oracle solution viv_i^* is treated as a positive, and vsSiv^s \in S_i as negatives. Maximizing the unnormalized likelihood ratio across all negatives,

vsSip(vic^i)p(vsc^i)=exp(vsSi[f(vi,c^i)+f(vs,c^i)])\prod_{v^s \in S_i} \frac{p(v_i^* \mid \hat{c}_i)}{p(v^s \mid \hat{c}_i)} = \exp \left( \sum_{v^s \in S_i} \big[ -f(v_i^*, \hat{c}_i) + f(v^s, \hat{c}_i) \big] \right)

recovers, up to a sign, precisely the multiple-negatives loss definition above. The normalization constant Z(c^i)Z(\hat{c}_i) is omitted in practice, obviating the need to enumerate the full feasible set VV.

3. Solver-Agnostic Solution Caching and Inner Approximation

Rather than invoking the combinatorial solver on each training forward pass, this approach maintains, for each instance ii, a dynamic cache SiS_i initialized with the oracle solution viv_i^*. Whenever the solver is evaluated with a new predicted vector c^i\hat{c}_i, the resulting solution is added to SiS_i if not already present. Over time, SiS_i accumulates diverse feasible solutions, forming an “inner approximation” to conv(V)\mathrm{conv}(V). Unlike LP relaxations which provide an outer approximation, cache-based methods maintain integrality and leverage previously discovered structure.

Outer Approximation Inner Approximation
Typical method LP relaxation Cached SiS_i
Integrality No Yes
Explores VV No Yes (via growing SiS_i)

4. Cache Lookup and Training Algorithm

The method probabilistically alternates between cache lookup and full solver invocation for each training instance and epoch, thereby enabling scalable mixed-mode optimization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Algorithm: Gradient Descent with Solution Cache
Input      : D = {(x_i, c_i)}, solver-call prob. p_solve
Initialize : model ω; caches S_i  {v_i^*} for i = 1n
for epoch = 1E do
  for each (x_i,c_i) in D do
    Ċ_i  m(ω, x_i)
    if rand() < p_solve:
        # growth step
        v  Solver(Ċ_i)            # v in V
        S_i  S_i  {v}
    else:
        # cache lookup
        v  argmin_{u  S_i} f(u, Ċ_i)
    Compute loss ℒ (e.g. ℒ_NCE or ℒ_MAP) using {v_i^*, S_i}
    ω  ω  η _ω ℒ

The hyperparameter psolve[0,1]p_{\mathrm{solve}} \in [0,1] controls the frequency of full solves. psolve=1p_{\mathrm{solve}} = 1 recovers full-complexity training with maximal cache fidelity, while psolve0p_{\mathrm{solve}} \approx 0 yields a fast, static approximation. Empirically, psolve=0.05p_{\mathrm{solve}} = 0.05 suffices to closely match the solution quality of the full-solve regime while greatly reducing computational burden.

5. Implementation Workflow and Hyperparameters

Implementation proceeds as follows:

  • Precompute viv_i^* for each instance, and initialize Si{vi}S_i \leftarrow \{v_i^*\}.
  • Select model m(ω,)m(\omega, \cdot), optimizer (e.g., Adam), learning rate η\eta, epochs EE.
  • Set psolvep_{\mathrm{solve}} to control cache refresh (e.g., 0.05).
  • In each minibatch forward pass, randomly decide solver call vs. cache using psolvep_{\mathrm{solve}}.
  • Compute one of the contrastive losses LNCE(c^c)\mathcal{L}_{\mathrm{NCE}}^{(\hat{c}-c)} or LMAP(c^c)\mathcal{L}_{\mathrm{MAP}}^{(\hat{c}-c)} and backpropagate.
  • Validate on held-out data for regret, tuning η\eta, psolvep_{\mathrm{solve}}, and batch size.

No generalization bounds are provided, but the loss approximation improves monotonically as Si|S_i| increases.

6. Empirical Results and Comparative Performance

Studies on canonical NP-hard tasks demonstrate that cached multiple negatives ranking losses deliver competitive, or even superior, decision regret relative to black-box differentiation (e.g., SPO+, implicit gradients) and QPTL/IPM relaxations, at markedly lower computational cost. For example, on Knapsack-120:

  • Full black-box (no cache): regret ≈ 528, per-epoch time ≈ 4s;
  • Cached LMAP(c^c)\mathcal{L}_{\mathrm{MAP}}^{(\hat{c}-c)}: regret ≈ 562, per-epoch time ≈ 0.5s (88% faster).

For the largest energy-aware scheduling instance, cached methods match best regret (~18,500) yet reduce epoch time from ~42s to ~1.5s. The caching wrapper is also effective for SPO+ and black-box methods, reducing per-epoch cost by up to an order-of-magnitude with negligible impact on solution quality. This suggests that the caching strategy generalizes as a modular performance improvement for a family of predict-and-optimize workflows.

7. Significance, Limitations, and Extensions

The Cached Multiple Negatives Ranking Loss provides a principled framework for integrating contrastive learning objectives with combinatorial optimization under uncertainty. It does not require relaxation of integrality constraints and is solver-agnostic, with a simple hyperparameter governing the trade-off between cost and fidelity. While no theoretical generalization bounds are presently given, loss approximation is guaranteed to improve as caches grow. A plausible implication is that, for large and complex feasible sets, the approach can scale to regimes where frequent optimization is otherwise infeasible, without loss of predictive or prescriptive quality (Mulamba et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cached Multiple Negatives Ranking Loss.