Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Supervised Contrastive Learning (2004.11362v5)

Published 23 Apr 2020 in cs.LG, cs.CV, and stat.ML

Abstract: Contrastive learning applied to self-supervised representation learning has seen a resurgence in recent years, leading to state of the art performance in the unsupervised training of deep image models. Modern batch contrastive approaches subsume or significantly outperform traditional contrastive losses such as triplet, max-margin and the N-pairs loss. In this work, we extend the self-supervised batch contrastive approach to the fully-supervised setting, allowing us to effectively leverage label information. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes. We analyze two possible versions of the supervised contrastive (SupCon) loss, identifying the best-performing formulation of the loss. On ResNet-200, we achieve top-1 accuracy of 81.4% on the ImageNet dataset, which is 0.8% above the best number reported for this architecture. We show consistent outperformance over cross-entropy on other datasets and two ResNet variants. The loss shows benefits for robustness to natural corruptions and is more stable to hyperparameter settings such as optimizers and data augmentations. Our loss function is simple to implement, and reference TensorFlow code is released at https://t.ly/supcon.

Connections Between Supervised Contrastive Loss, Cross Entropy Loss, Label Smoothing, and Noise Contrastive Loss

The paper "Connections between supervised contrastive loss, cross entropy loss, label smoothing and noise contrastive loss" examines the relationships between different loss functions commonly used in machine learning, focusing on the theoretical underpinnings that link them. The primary goal of this work is to scrutinize how contrastive loss, cross-entropy loss, label smoothing, and noise contrastive estimation (NCE) are interconnected.

Overview

The paper starts by defining the basics of contrastive loss, where the loss function aims to bring similar data points closer and push dissimilar ones apart in the embedding space. It introduces the multi-positive, multi-negative contrastive loss, formalized as:

Lc=ipilogeziTzpi/τeziTzpi/τ+jeziTznij/τL_c = - \sum_i \sum_{p_i} \log \frac{e^{z_i^Tz_{p_i}/\tau}}{e^{z_i^Tz_{p_i}/\tau} + \sum_j e^{z_i^Tz_{n^j_i}/\tau}}

Here, ziz_i, zpiz_{p_i}, and zniz_{n_i} are the embedding vectors for the sample, positive, and negative pairs respectively, and τ\tau is the temperature parameter.

Cross Entropy Loss as a Special Case

In the specific scenario where there is a single positive and C1C-1 negatives for CC classes, the paper reformulates the contrastive loss to show its equivalence with the cross-entropy loss. Specifically, the cross-entropy loss LceL_{ce} is derived from:

Lx=ilogeziTypi/τeziTypi/τ+jeziTynij/τL_x = - \sum_i \log \frac{e^{z_i^Ty_{p_i}/\tau}}{e^{z_i^Ty_{p_i}/\tau} + \sum_j e^{z_i^Ty_{n^j_i}/\tau}}

And it simplifies further to:

Lce=icαiclogezic/τcezic/τL_{ce} = - \sum_i \sum_c \alpha_i^c \log \frac{e^{z_i^c/\tau}}{\sum_{c'} e^{z_i^{c'}/\tau}}

This equivalence provides a foundation for interpreting mutual information (MI) based on contrastive losses, with the data samples and labels viewed as different perspectives of the underlying information.

Label Smoothing Interpretation

When examining label smoothing within the cross-entropy framework, the analysis moves towards defining an upper bound based on a special form of contrastive loss. By pulling αic\alpha_i^c back into the logarithmic function and assuming αic1\alpha_i^c \leq 1, the label-smoothed cross-entropy loss LcelsL_{cels} can be defined with specific temperature adjustments for each sample/class pair. Effectively, this portrays label smoothing as a variant where multiple class-specific temperatures and multiple positives per sample are integrated.

Noise Contrastive Estimation (NCE)

The paper also touches upon NCE as a specialized case of cross-entropy loss where there are only two classes (one for data and another for noise). This unifies various forms of losses under a common framework, facilitating an MI-based interpretation that builds on both data and noise distributions.

Key Points of Analysis

The paper presents several questions that probe deeper into the exploration of these loss functions:

  1. Optimal Form of Contrastive Loss: It suggests a need to investigate the optimal configurations for choosing positive and negative vectors, per-sample temperature scaling, and normalization. Analyzing these choices involves careful theoretical assessment and empirical validation.
  2. Incorporation of Noise: The inclusion of noise as additional negatives necessitates a refined MI interpretation. This scenario enriches the contrastive loss framework by combining data and noise distributions.
  3. Cross-Entropy and Contrastive Loss Analysis: Techniques developed for analyzing cross-entropy may prove beneficial in understanding contrastive losses and vice versa. This cross-pollination of analytical techniques can lead to new insights.

Implications and Future Work

Recognizing the interplay between these loss functions carries significant implications for both theoretical machine learning and practical applications. It suggests that the development and improvement of one type of loss function can inform the enhancement of others. For future work, examining the scenarios where these loss functions converge or diverge in their effectiveness will be essential. Additionally, investigating the impact of different configurations and parameters empirically could offer more robust guidelines for deploying these loss functions in various machine learning tasks.

In summary, this paper provides a comprehensive theoretical framework that bridges several key loss functions, potentially paving the way for more unified and efficient learning paradigms.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Prannay Khosla (2 papers)
  2. Piotr Teterwak (16 papers)
  3. Chen Wang (600 papers)
  4. Aaron Sarna (10 papers)
  5. Yonglong Tian (32 papers)
  6. Phillip Isola (84 papers)
  7. Aaron Maschinot (5 papers)
  8. Ce Liu (51 papers)
  9. Dilip Krishnan (36 papers)
Citations (3,938)
Youtube Logo Streamline Icon: https://streamlinehq.com