Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Similarity Loss in Deep Metric Learning

Updated 26 May 2026
  • Multi-Similarity Loss is a loss function for deep metric learning that integrates self, positive-relative, and negative-relative signals to improve embedding quality.
  • It employs a General Pair Weighting framework to mine and weight informative pairs, ensuring larger gradient signals and efficient training.
  • Extensions like MSCon and SMS adapt the loss for multi-attribute and soft-label scenarios, significantly boosting retrieval accuracy and generalization.

Multi-Similarity Loss is a class of loss functions central to modern deep metric learning and contrastive representation learning. It regularizes embedding models by leveraging information from multiple notions of similarity, outperforming traditional pair/triplet-based approaches in image retrieval, cross-modal retrieval, and robust representation learning. Integrating the General Pair Weighting (GPW) framework, multi-similarity loss enables principled and efficient mining and weighting of training pairs, and extensions such as Multi-Similarity Contrastive Loss (MSCon) and Symmetric Multi-Similarity Loss (SMS) exploit multiple metrics or soft-label information for enhanced performance and generalization.

1. Motivation and Historical Background

In classical deep metric learning, most methods relied on fixed rules for positive and negative mining, such as “contrastive,” “triplet,” and “lifted-structure” losses. These approaches were fundamentally limited by redundant pair sampling and coarse, uniform weighting schemes. They typically only exploited a single “signal” per pair: either the raw similarity (“self”), the relative ranking among positives, or the separation from negatives in the batch.

Multi-Similarity Loss (MS Loss), introduced by Wang et al. (Wang et al., 2019), addressed these limitations by supporting three distinct similarity signals—self, positive-relative, and negative-relative—within a unified, differentiable formulation. This principled weighting allows broader exploitation of batch information and enables larger, more informative gradients per batch step. Subsequent variants, such as Multi-Similarity Contrastive Loss (MSCon) and Symmetric Multi-Similarity Loss (SMS), further extended this framework to settings with multiple, possibly uncertain, notions of similarity (Mu et al., 2023, Wang et al., 2024). Such scenarios are prevalent in real-world data, where objects are annotated with multiple categorical or soft affiliations.

2. The General Pair Weighting and Multi-Similarity Loss Formulation

Multi-Similarity Loss is rooted in the General Pair Weighting (GPW) view, where the gradient of any pair-based metric learning loss decomposes as a sum of pairwise weights:

Lθ=i,jwijSijθ\frac{\partial \mathcal{L}}{\partial \theta} = \sum_{i,j} w_{ij} \frac{\partial S_{ij}}{\partial \theta}

where wij=LSijw_{ij} = \left| \frac{\partial \mathcal{L}}{\partial S_{ij}} \right|, SijS_{ij} is the cosine similarity between f(xi)f(x_i) and f(xj)f(x_j), and ff is a unit-normalizing embedding function.

Original Multi-Similarity Loss

Given a batch {xi}\{x_i\} with labels yiy_i, positives PiP_i, and negatives NiN_i, the loss is formulated as:

wij=LSijw_{ij} = \left| \frac{\partial \mathcal{L}}{\partial S_{ij}} \right|0

with sharpness parameters wij=LSijw_{ij} = \left| \frac{\partial \mathcal{L}}{\partial S_{ij}} \right|1 and margin wij=LSijw_{ij} = \left| \frac{\partial \mathcal{L}}{\partial S_{ij}} \right|2 (Wang et al., 2019). The loss uses an explicit mining step to focus on “informative” positives/negatives based on relative similarity, followed by a soft weighting based on both wij=LSijw_{ij} = \left| \frac{\partial \mathcal{L}}{\partial S_{ij}} \right|3 and its hardness compared to other pairs.

Mining and Weighting Mechanism

  • Informative positive set: wij=LSijw_{ij} = \left| \frac{\partial \mathcal{L}}{\partial S_{ij}} \right|4
  • Informative negative set: wij=LSijw_{ij} = \left| \frac{\partial \mathcal{L}}{\partial S_{ij}} \right|5
  • Pairs are then exponentially weighted and combined in the loss.

This design ensures that only the most “violating” or “hard” pairs contribute significant gradient signal, improving both retrieval precision and training efficiency.

3. Extensions: Multi-Similarity Contrastive and Symmetric Multi-Similarity Losses

Multi-Similarity Contrastive Loss (MSCon)

When data carries multiple categorical or semantic attributes (e.g., category, closure, gender for images), each attribute induces a distinct similarity relation. MSCon, as introduced by Mu et al. (Mu et al., 2023), learns one projection head per metric and forms a multi-similarity objective by summing a supervised contrastive (SupCon) loss per metric:

wij=LSijw_{ij} = \left| \frac{\partial \mathcal{L}}{\partial S_{ij}} \right|6

where wij=LSijw_{ij} = \left| \frac{\partial \mathcal{L}}{\partial S_{ij}} \right|7 is a SupCon loss over the wij=LSijw_{ij} = \left| \frac{\partial \mathcal{L}}{\partial S_{ij}} \right|8 relational head.

Uncertainty-based Task Weighting

MSCon incorporates a learnable task-specific uncertainty wij=LSijw_{ij} = \left| \frac{\partial \mathcal{L}}{\partial S_{ij}} \right|9, yielding the regularized objective:

SijS_{ij}0

This weighting down-scales the contribution of “uncertain” or noisy similarity tasks, leading to better out-of-domain (OOD) generalization and more robust multi-attribute representations (Mu et al., 2023).

Symmetric Multi-Similarity Loss

For cross-modal or soft-label scenarios (e.g., video–text with soft correlation matrices), the Symmetric Multi-Similarity Loss (SMS) employs the difference between soft correlation scores SijS_{ij}1 as the margin, enforcing a symmetric ordering via hinge-style triplet loss:

SijS_{ij}2

where SijS_{ij}3 controls the margin and SijS_{ij}4 is a relaxation factor to prevent degenerate updates when SijS_{ij}5 (Wang et al., 2024).

4. Algorithmic and Implementation Details

Multi-Similarity Loss and its derivatives are implemented via efficient matrix operations within deep learning frameworks:

  • Batch construction: Use multiple samples per class to enable informative positive and negative mining.
  • Pairwise similarity matrix computation: Compute all cosine similarities in the batch (SijS_{ij}6); efficient masking is used to select anchor–positive and anchor–negative pairs.
  • Mining step: For each anchor, vectorized reduction is used to extract hardest positives/negatives and construct the sets SijS_{ij}7, SijS_{ij}8.
  • Weighting step: Exponential (softmax-like) weighting over the mined pairs for greater gradient selectivity.
  • Stabilization: Care is taken to avoid numerical overflow in exponentials by judicious parameter selection (e.g., SijS_{ij}9, f(xi)f(x_i)0).
  • Batch size: Empirically, robust estimation requires batch sizes of at least f(xi)f(x_i)1–f(xi)f(x_i)2 for effective mining.
  • Final update: Fully vectorized gradient calculation is supported, with no need for custom backward passes.

For multi-task cases (MSCon), each metric’s loss is weighted by the inverse variance f(xi)f(x_i)3 and self-regularized by f(xi)f(x_i)4; gradients are accumulated over all tasks before joint optimization (Mu et al., 2023).

5. Empirical Performance and Ablation Studies

Multi-Similarity Loss and its generalizations deliver state-of-the-art performance on multiple benchmarks:

Dataset Loss/Method Recall@1 (%) or Top-1 (%) Key Setting/Attribute
CUB-200 MS Loss 65.7 d=512, vs. 60.6 (ABE)
Cars-196 MS Loss 84.1 vs. 81.4 (HTL)
In-Shop Clothes MS Loss 89.7 vs. 80.9 (prior)
SOP MS Loss 78.2 vs. 74.8 (ABE)
Zappos50k MSCon 97.17/94.37/85.98 Category/Closure/Gender
MEDIC MSCon 81.00/79.14/81.69/85.15 Multi-attribute, in-domain
EK-100 SMS 57.0/69.2, 62.1/73.0 ViT-B, ViT-L (mAP/nDCG)

Ablation studies reveal that:

  • Incorporating all three signals (P+S+N) yields stronger performance than using any single mining or weighting component (Wang et al., 2019).
  • Learned uncertainty weighting (in MSCon) significantly improves out-of-domain accuracy, especially when certain similarity metrics are noisy or intentionally corrupted (Mu et al., 2023).
  • Introducing relaxation factor f(xi)f(x_i)5 in SMS yields notable boosts in mAP (Wang et al., 2024).
  • SMS outperforms adaptive MI-MM variants by explicit utilization of soft-label differences and symmetric loss structure (Wang et al., 2024).

6. Comparative Analysis and Practical Implications

Multi-Similarity Loss unifies and extends traditional pair-based and triplet-based losses:

  • Contrastive loss: Only exploits self-similarity, with all mined pairs weighted equally.
  • Triplet/Historam/Lifted structure: Partially exploit positive- or negative- relative signals, but lack joint mining and weighting.
  • MS Loss: Combines strict mining (positive-relative) with soft, differentiable weighting (self and negative-relative), yielding sharper gradient focus and better utilization of informative pairs (Wang et al., 2019).

Extensions such as MSCon and SMS are directly suited to multi-task and soft-label settings:

  • MSCon dynamically balances contributions from multiple relations by uncertainty-based weights, leading to generalizable and robust embedding models (Mu et al., 2023).
  • SMS generalizes the batch mining and weighting approach to soft, real-valued label relations, appropriate for complex retrieval benchmarks (Wang et al., 2024).

These methods are suited for retrieval, classification with multiple labels, and scenarios with heterogeneous or noisy supervision.

7. Limitations and Best Practices

Key limitations and recommendations include:

  • Mining margin f(xi)f(x_i)6 and weighting sharpness parameters must be selected appropriately to ensure the presence of informative pairs and avoid degenerate gradients.
  • Batch size must be sufficient to supply positives/negatives per anchor.
  • Large values of f(xi)f(x_i)7 or f(xi)f(x_i)8 may cause numerical overflow; parameter tuning or use of mixed precision is advised.
  • In multi-task or attribute-rich settings, uncertainty-based weighting is critical to prevent noisy tasks from degrading overall representations.
  • For soft-label scenarios, the relaxation term f(xi)f(x_i)9 prevents wasted model capacity on near-duplicate label pairs.

Adhering to these guidelines, Multi-Similarity Loss remains a robust and adaptable family for high-performance metric learning across domains (Wang et al., 2019, Mu et al., 2023, Wang et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Similarity Loss.