Supervised Contrastive Learning (2004.11362v5)

Published 23 Apr 2020 in cs.LG, cs.CV, and stat.ML

Abstract: Contrastive learning applied to self-supervised representation learning has seen a resurgence in recent years, leading to state of the art performance in the unsupervised training of deep image models. Modern batch contrastive approaches subsume or significantly outperform traditional contrastive losses such as triplet, max-margin and the N-pairs loss. In this work, we extend the self-supervised batch contrastive approach to the fully-supervised setting, allowing us to effectively leverage label information. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes. We analyze two possible versions of the supervised contrastive (SupCon) loss, identifying the best-performing formulation of the loss. On ResNet-200, we achieve top-1 accuracy of 81.4% on the ImageNet dataset, which is 0.8% above the best number reported for this architecture. We show consistent outperformance over cross-entropy on other datasets and two ResNet variants. The loss shows benefits for robustness to natural corruptions and is more stable to hyperparameter settings such as optimizers and data augmentations. Our loss function is simple to implement, and reference TensorFlow code is released at https://t.ly/supcon.

PDF Abstract

Connections Between Supervised Contrastive Loss, Cross Entropy Loss, Label Smoothing, and Noise Contrastive Loss

The paper "Connections between supervised contrastive loss, cross entropy loss, label smoothing and noise contrastive loss" examines the relationships between different loss functions commonly used in machine learning, focusing on the theoretical underpinnings that link them. The primary goal of this work is to scrutinize how contrastive loss, cross-entropy loss, label smoothing, and noise contrastive estimation (NCE) are interconnected.

Overview

The paper starts by defining the basics of contrastive loss, where the loss function aims to bring similar data points closer and push dissimilar ones apart in the embedding space. It introduces the multi-positive, multi-negative contrastive loss, formalized as:

$L_c = - \sum_i \sum_{p_i} \log \frac{e^{z_i^Tz_{p_i}/\tau}}{e^{z_i^Tz_{p_i}/\tau} + \sum_j e^{z_i^Tz_{n^j_i}/\tau}}$

Here, $z_i$ , $z_{p_i}$ , and $z_{n_i}$ are the embedding vectors for the sample, positive, and negative pairs respectively, and $\tau$ is the temperature parameter.

Cross Entropy Loss as a Special Case

In the specific scenario where there is a single positive and $C-1$ negatives for $C$ classes, the paper reformulates the contrastive loss to show its equivalence with the cross-entropy loss. Specifically, the cross-entropy loss $L_{ce}$ is derived from:

$L_x = - \sum_i \log \frac{e^{z_i^Ty_{p_i}/\tau}}{e^{z_i^Ty_{p_i}/\tau} + \sum_j e^{z_i^Ty_{n^j_i}/\tau}}$

And it simplifies further to:

$L_{ce} = - \sum_i \sum_c \alpha_i^c \log \frac{e^{z_i^c/\tau}}{\sum_{c'} e^{z_i^{c'}/\tau}}$

This equivalence provides a foundation for interpreting mutual information (MI) based on contrastive losses, with the data samples and labels viewed as different perspectives of the underlying information.

Label Smoothing Interpretation

When examining label smoothing within the cross-entropy framework, the analysis moves towards defining an upper bound based on a special form of contrastive loss. By pulling $\alpha_i^c$ back into the logarithmic function and assuming $\alpha_i^c \leq 1$ , the label-smoothed cross-entropy loss $L_{cels}$ can be defined with specific temperature adjustments for each sample/class pair. Effectively, this portrays label smoothing as a variant where multiple class-specific temperatures and multiple positives per sample are integrated.

Noise Contrastive Estimation (NCE)

The paper also touches upon NCE as a specialized case of cross-entropy loss where there are only two classes (one for data and another for noise). This unifies various forms of losses under a common framework, facilitating an MI-based interpretation that builds on both data and noise distributions.

Key Points of Analysis

The paper presents several questions that probe deeper into the exploration of these loss functions:

Optimal Form of Contrastive Loss: It suggests a need to investigate the optimal configurations for choosing positive and negative vectors, per-sample temperature scaling, and normalization. Analyzing these choices involves careful theoretical assessment and empirical validation.
Incorporation of Noise: The inclusion of noise as additional negatives necessitates a refined MI interpretation. This scenario enriches the contrastive loss framework by combining data and noise distributions.
Cross-Entropy and Contrastive Loss Analysis: Techniques developed for analyzing cross-entropy may prove beneficial in understanding contrastive losses and vice versa. This cross-pollination of analytical techniques can lead to new insights.

Implications and Future Work

Recognizing the interplay between these loss functions carries significant implications for both theoretical machine learning and practical applications. It suggests that the development and improvement of one type of loss function can inform the enhancement of others. For future work, examining the scenarios where these loss functions converge or diverge in their effectiveness will be essential. Additionally, investigating the impact of different configurations and parameters empirically could offer more robust guidelines for deploying these loss functions in various machine learning tasks.

In summary, this paper provides a comprehensive theoretical framework that bridges several key loss functions, potentially paving the way for more unified and efficient learning paradigms.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Prannay Khosla (2 papers)
Piotr Teterwak (16 papers)
Chen Wang (600 papers)
Aaron Sarna (10 papers)
Yonglong Tian (32 papers)
Phillip Isola (84 papers)
Aaron Maschinot (5 papers)
Ce Liu (51 papers)
Dilip Krishnan (36 papers)

Citations (3,938)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/MoummadIlyass/status/1835568120368451886

https://twitter.com/BlackHC/status/1747391499032043611

YouTube

Show All Videos