Global Supervised Contrastive Loss
- GSupCon is an embedding learning objective that leverages a global memory bank to overcome batch-size limitations and enhance dataset-wide discrimination.
- It replaces local contrastive sampling with a global dictionary of positives and negatives, improving efficiency and representation quality.
- Empirical results show that GSupCon yields significant gains in mAP and rank-1 accuracy for re-identification and fine-grained classification tasks.
Global-Supervised Contrastive Loss (GSupCon) is an embedding learning objective that extends batch-based supervised contrastive learning to enable dataset-wide (global) discrimination during representation training. GSupCon addresses the inherent locality and batch-size dependence of classical supervised contrastive (SupCon) loss by leveraging a memory bank or global dictionary of features from the entire training set. This allows each anchor to be contrasted against true positives and negatives drawn from all available training examples, enhancing discriminative power and improving generalization, especially in large-scale identification and retrieval tasks such as vehicle re-identification, person re-identification, face recognition, and fine-grained classification (Hu et al., 2022, Kim et al., 2022, Khosla et al., 2020).
1. Motivation: From Local to Global Supervised Contrastive Learning
Standard supervised contrastive loss (SupCon) pulls together representations of samples with matching labels within a minibatch and pushes apart negative pairs, but it is intrinsically local: both positive and negative sets for an anchor are restricted to the current minibatch. This locality leads to several limitations. SupCon’s effectiveness depends on batch size since the number of available negative identities—and thus the tightness of class separation—is limited by the minibatch. Empirical and theoretical evidence indicates that increasing the number of negatives in the denominator improves generalization and class separation, but naively scaling up batch size demands prohibitively more GPU memory and can hinder convergence.
GSupCon overcomes these bottlenecks by constructing a global dictionary (i.e., a memory bank) where, for each anchor, the positives and negatives can be selected from the entire training set. This enables a given anchor to be discriminatively contrasted against every other sample, enforcing global class separation without escalating memory overhead, since only the anchor feature is updated by gradients while positives and negatives in the dictionary are read-only, thus removing their gradient paths (Hu et al., 2022).
2. Formal Mathematical Formulation
Let denote the complete set of training images, be the anchor feature computed by the network for a sample , and be the (normalized) feature stored for image in the global dictionary . Define the positive set for anchor as and the negative set as .
The GSupCon loss per batch is: where 0 is the temperature hyperparameter. Only 1 receives gradients; the features in the dictionary are updated using a momentum-based moving average (Hu et al., 2022).
This formulation replaces batch-limited positive and negative sets with global populations, enforcing that every anchor is simultaneously repelled from all global negatives and attracted to all global positives.
3. Training Procedure and Dictionary Management
GSupCon requires maintaining a global feature dictionary 2:
- Initialization: 3 is seeded by running a few epochs with standard SupCon or by one forward pass over all training images.
- At each iteration:
- Compute features 4 for minibatch 5 of size 6.
- For each anchor 7, retrieve all positives and negatives from 8 and compute 9 with global contrast.
- Backpropagate; only the anchor 0 carries gradient.
- Update 1 (optionally normalize), where 2 is the momentum coefficient.
Dictionary entries are kept outside the gradient tape, minimizing memory use to 3 for the dictionary and 4 per-batch for backpropagation (Hu et al., 2022, Khosla et al., 2020).
4. Key Hyperparameters and Practical Considerations
The effectiveness and efficiency of GSupCon depend on several core hyperparameters:
- Temperature 5: Controls softmax sharpness; typical values are 6.
- Batch size 7: Primarily impacts update frequency rather than negative pool size; moderate values (e.g., 64) suffice.
- Memory bank momentum 8: Sets the update rate for the dictionary; values in 9 are standard.
- Loss weight 0: When combined with cross-entropy or other losses, set 1 to balance gradient magnitudes.
- Denominator efficiency: The denominator requires a sum across 2 entries; practical speedups include negative subsampling (e.g., 10k–50k per update) or nearest-neighbor focus.
For large datasets (e.g., 3k images with 4 feature dimension), dictionary storage is tractable (≈2.4GB in float32) and may be further reduced by quantization or per-class centroids (Hu et al., 2022).
5. Empirical Evaluations and Discriminative Impact
GSupCon has demonstrated its efficacy on vehicle re-identification benchmarks of varying scale:
- VeRi-776 (576 train IDs): SupCon + GSupCon achieves mAP of 5 compared to 6 for SupCon alone.
- VehicleID (13,164 train IDs): GSupCon improves rank-1 by 7–8, with combined losses yielding rank-9.
- VERI_Wild (30,671 train IDs): GSupCon surpasses SupCon by 0–1 mAP.
Visualization of embedding spaces shows markedly tighter intra-class clusters and broader inter-class separations. Retrieval rankings for difficult positive pairs (e.g., extreme viewpoints) are significantly improved under GSupCon (Hu et al., 2022). On image classification, the global variant provides 2–3 absolute improvements on standard metrics over local SupCon (Khosla et al., 2020).
6. Computational Complexity and Scalability
The per-batch time complexity is naively 4 due to the global denominator. Subsampling negatives or approximating via nearest-neighbor search is the typical remedy. The memory bank eliminates the need for backpropagation through non-anchor features, so backprop memory scales only as 5. The approach scales gracefully to very large datasets, trading denominator computation for significant gains in representation quality and generalization, particularly when 6 batch size (Hu et al., 2022, Khosla et al., 2020).
| Parameter | Typical Value/Range | Role |
|---|---|---|
| 7 | 8 | Distribution sharpness |
| 9 | 0 | Batch size |
| 1 | 2 | Momentum in dictionary |
| Dictionary size | 3 | All features in train set |
7. Extensions and Applications Beyond Vehicle ReID
GSupCon is domain-agnostic and compatible with any architecture or embedding paradigm relying on large-scale negative sampling. Identified application domains include:
- Person re-identification at city-scale: Gallery with millions of IDs.
- Face recognition in unconstrained settings: Maintains a global celebrity embedding bank.
- Fine-grained image retrieval: Birds, cars, products where inter-class variations are minimal.
- Semi-supervised/self-supervised learning: Mix supervised GSupCon with unsupervised momentum contrast for unlabeled data.
- Cross-modal contrastive learning: Build joint global dictionaries for image–text or other modalities.
- Few-shot classification: Use a global memory on a base dataset, adapt using GSupCon to imprint novel classes.
The central property enabling these applications is exhaustive, dataset-wide discrimination at each update, yielding far better manifold separation and embedding robustness as dataset scale increases (Hu et al., 2022).