Papers
Topics
Authors
Recent
2000 character limit reached

Supervised Contrastive Learning

Published 23 Apr 2020 in cs.LG, cs.CV, and stat.ML | (2004.11362v5)

Abstract: Contrastive learning applied to self-supervised representation learning has seen a resurgence in recent years, leading to state of the art performance in the unsupervised training of deep image models. Modern batch contrastive approaches subsume or significantly outperform traditional contrastive losses such as triplet, max-margin and the N-pairs loss. In this work, we extend the self-supervised batch contrastive approach to the fully-supervised setting, allowing us to effectively leverage label information. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes. We analyze two possible versions of the supervised contrastive (SupCon) loss, identifying the best-performing formulation of the loss. On ResNet-200, we achieve top-1 accuracy of 81.4% on the ImageNet dataset, which is 0.8% above the best number reported for this architecture. We show consistent outperformance over cross-entropy on other datasets and two ResNet variants. The loss shows benefits for robustness to natural corruptions and is more stable to hyperparameter settings such as optimizers and data augmentations. Our loss function is simple to implement, and reference TensorFlow code is released at https://t.ly/supcon.

Citations (3,938)

Summary

  • The paper demonstrates the equivalence between contrastive loss and cross entropy loss in scenarios with one positive and multiple negatives, laying a theoretical foundation for these methods.
  • It reinterprets label smoothing as a variant of contrastive loss by integrating class-specific temperature adjustments, offering a novel perspective on regularization.
  • By unifying supervised contrastive, cross entropy, and noise contrastive losses, the study provides actionable insights for configuring loss functions to optimize model training.

Connections Between Supervised Contrastive Loss, Cross Entropy Loss, Label Smoothing, and Noise Contrastive Loss

The paper "Connections between supervised contrastive loss, cross entropy loss, label smoothing and noise contrastive loss" examines the relationships between different loss functions commonly used in machine learning, focusing on the theoretical underpinnings that link them. The primary goal of this work is to scrutinize how contrastive loss, cross-entropy loss, label smoothing, and noise contrastive estimation (NCE) are interconnected.

Overview

The paper starts by defining the basics of contrastive loss, where the loss function aims to bring similar data points closer and push dissimilar ones apart in the embedding space. It introduces the multi-positive, multi-negative contrastive loss, formalized as:

Lc=ipilogeziTzpi/τeziTzpi/τ+jeziTznij/τL_c = - \sum_i \sum_{p_i} \log \frac{e^{z_i^Tz_{p_i}/\tau}}{e^{z_i^Tz_{p_i}/\tau} + \sum_j e^{z_i^Tz_{n^j_i}/\tau}}

Here, ziz_i, zpiz_{p_i}, and zniz_{n_i} are the embedding vectors for the sample, positive, and negative pairs respectively, and τ\tau is the temperature parameter.

Cross Entropy Loss as a Special Case

In the specific scenario where there is a single positive and C1C-1 negatives for CC classes, the paper reformulates the contrastive loss to show its equivalence with the cross-entropy loss. Specifically, the cross-entropy loss LceL_{ce} is derived from:

Lx=ilogeziTypi/τeziTypi/τ+jeziTynij/τL_x = - \sum_i \log \frac{e^{z_i^Ty_{p_i}/\tau}}{e^{z_i^Ty_{p_i}/\tau} + \sum_j e^{z_i^Ty_{n^j_i}/\tau}}

And it simplifies further to:

Lce=icαiclogezic/τcezic/τL_{ce} = - \sum_i \sum_c \alpha_i^c \log \frac{e^{z_i^c/\tau}}{\sum_{c'} e^{z_i^{c'}/\tau}}

This equivalence provides a foundation for interpreting mutual information (MI) based on contrastive losses, with the data samples and labels viewed as different perspectives of the underlying information.

Label Smoothing Interpretation

When examining label smoothing within the cross-entropy framework, the analysis moves towards defining an upper bound based on a special form of contrastive loss. By pulling αic\alpha_i^c back into the logarithmic function and assuming αic1\alpha_i^c \leq 1, the label-smoothed cross-entropy loss LcelsL_{cels} can be defined with specific temperature adjustments for each sample/class pair. Effectively, this portrays label smoothing as a variant where multiple class-specific temperatures and multiple positives per sample are integrated.

Noise Contrastive Estimation (NCE)

The paper also touches upon NCE as a specialized case of cross-entropy loss where there are only two classes (one for data and another for noise). This unifies various forms of losses under a common framework, facilitating an MI-based interpretation that builds on both data and noise distributions.

Key Points of Analysis

The paper presents several questions that probe deeper into the exploration of these loss functions:

  1. Optimal Form of Contrastive Loss: It suggests a need to investigate the optimal configurations for choosing positive and negative vectors, per-sample temperature scaling, and normalization. Analyzing these choices involves careful theoretical assessment and empirical validation.
  2. Incorporation of Noise: The inclusion of noise as additional negatives necessitates a refined MI interpretation. This scenario enriches the contrastive loss framework by combining data and noise distributions.
  3. Cross-Entropy and Contrastive Loss Analysis: Techniques developed for analyzing cross-entropy may prove beneficial in understanding contrastive losses and vice versa. This cross-pollination of analytical techniques can lead to new insights.

Implications and Future Work

Recognizing the interplay between these loss functions carries significant implications for both theoretical machine learning and practical applications. It suggests that the development and improvement of one type of loss function can inform the enhancement of others. For future work, examining the scenarios where these loss functions converge or diverge in their effectiveness will be essential. Additionally, investigating the impact of different configurations and parameters empirically could offer more robust guidelines for deploying these loss functions in various machine learning tasks.

In summary, this paper provides a comprehensive theoretical framework that bridges several key loss functions, potentially paving the way for more unified and efficient learning paradigms.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 17 likes about this paper.