Dissecting Supervised Contrastive Learning (2102.08817v4)

Published 17 Feb 2021 in stat.ML and cs.LG

Abstract: Minimizing cross-entropy over the softmax scores of a linear map composed with a high-capacity encoder is arguably the most popular choice for training neural networks on supervised learning tasks. However, recent works show that one can directly optimize the encoder instead, to obtain equally (or even more) discriminative representations via a supervised variant of a contrastive objective. In this work, we address the question whether there are fundamental differences in the sought-for representation geometry in the output space of the encoder at minimal loss. Specifically, we prove, under mild assumptions, that both losses attain their minimum once the representations of each class collapse to the vertices of a regular simplex, inscribed in a hypersphere. We provide empirical evidence that this configuration is attained in practice and that reaching a close-to-optimal state typically indicates good generalization performance. Yet, the two losses show remarkably different optimization behavior. The number of iterations required to perfectly fit to data scales superlinearly with the amount of randomly flipped labels for the supervised contrastive loss. This is in contrast to the approximately linear scaling previously reported for networks trained with cross-entropy.

Authors (4)

Florian Graf (8 papers)
Christoph D. Hofer (3 papers)
Marc Niethammer (80 papers)
Roland Kwitt (34 papers)

Citations (60)

View on Semantic Scholar

Summary

This paper, "Dissecting Supervised Contrastive Learning" (Graf et al., 2021 ), investigates the fundamental differences between training neural networks for supervised classification using the standard Cross-Entropy (CE) loss and the Supervised Contrastive (SC) loss. The authors explore the geometry of learned representations and the dynamics of optimization.

Core Findings

Similar Optimal Representation Geometry: The paper provides theoretical proofs (under mild assumptions like an ideal encoder, balanced classes, and norm constraints on representations/weights) that both CE and SC losses achieve their minimum when the representations of instances belonging to the same class collapse to a single point, and these points (one for each class) form the vertices of a regular simplex inscribed in a hypersphere. For the CE loss, the linear classifier weights are shown to be scalar multiples of these simplex vertices, also forming a regular simplex. This theoretical insight suggests that despite their different formulations, both losses aim for a similar highly discriminative configuration of representations at optimality.
- Practical Implication: This implies that regardless of whether you use CE or SC, if your model achieves a low loss and the encoder is sufficiently powerful, you should expect the learned representations to exhibit this structured, separated geometry. This geometry is beneficial for classification as classes are maximally separated relative to their variance.
Different Optimization Dynamics: While the theoretical optima are similar, the paper demonstrates empirically that the two losses exhibit remarkably different optimization behaviors. Specifically, when training on data with increasing levels of random label corruption, the "time to fit" (iterations required to reach zero training error) for models trained with CE scales approximately linearly with the corruption level. In contrast, models trained with SC show a clearly superlinear increase in the time to fit. Beyond a certain corruption level, SC training fails to reach zero training error within a fixed iteration budget.
- Practical Implication: This suggests that SC loss, due to its batch-wise formulation involving attraction and repulsion terms between pairs of samples, acts as an implicit regularizer during optimization. It is less prone to overfitting to noisy labels compared to the instance-wise CE loss. This inherent robustness is a significant practical advantage for SC, especially when dealing with datasets that may contain label noise.

Implementation and Application

Encoder Architecture: The paper uses a standard ResNet-18 as the encoder ( $\phi_\theta$ ) for its experiments on CIFAR10 and CIFAR100. The output of the encoder is a high-dimensional vector (512 dimensions for ResNet-18).
Loss Functions:
- Cross-Entropy (CE): The standard approach. The encoder output is fed into a linear classifier ( $W$ ), and the CE loss is computed between the softmax output and the one-hot encoded labels. This is implemented as $L_{CE} = -\frac{1}{N} \sum_{n=1}^N \log \frac{\exp(\mathbf{z}_n \cdot \mathbf{w}_{y_n})}{\sum_{l=1}^K \exp(\mathbf{z}_n \cdot \mathbf{w}_l)}$ , where $\mathbf{z}_n = \phi_\theta(\mathbf{x}_n)$ and $\mathbf{w}_y$ is the weight vector for class $y$ . L2 regularization on weights ( $+\lambda ||W||_F^2$ ) is commonly added and analyzed in the paper's theoretical section.
- Supervised Contrastive (SC): The SC loss is applied directly to the normalized encoder outputs. The paper uses a variant where the representations are projected onto a hypersphere (radius $\rho=1/\sqrt{\tau}$ related to temperature $\tau$ ). The loss is computed over batches, pulling representations of the same class closer and pushing different classes apart based on cosine similarity. The core idea, as formulated by Khosla et al. [Khosla et al., 2020], is to minimize:
  
  $L_{SC} = \sum_{i \in B} \frac{-1}{|P(i)|} \sum_{p \in P(i)} \log \frac{\exp(\mathbf{z}_i \cdot \mathbf{z}_p / \tau)}{\sum_{a \in A(i)} \exp(\mathbf{z}_i \cdot \mathbf{z}_a / \tau)}$
  
  where $B$ is a batch, $P(i)$ are indices in $B$ with the same label as $i$ (excluding $i$ ), $A(i)$ are indices in $B$ with any label (excluding $i$ ), $\mathbf{z} = \phi_\theta(\mathbf{x}) / ||\phi_\theta(\mathbf{x})||_2$ , and $\tau$ is a temperature hyperparameter. After training the encoder with SC, a separate linear classifier is trained on top using the frozen encoder features.
Normalization: The theoretical results for both losses rely on representations being on a hypersphere or at least norm-bounded. In practice, batch normalization is commonly used in network architectures like ResNet-18, which can implicitly normalize the representations. For SC, explicit L2 normalization of the encoder output before computing the loss is standard practice.
Measuring Geometry: The paper uses cosine similarity to quantify how close the learned representations are to the theoretically optimal simplex configuration:
- Cosine similarity between class means (should be close to $-1/(K-1)$ for a regular simplex).
- Cosine similarity between classifier weights (for CE, should also be close to $-1/(K-1)$ ).
- Cosine similarity between individual representations and their class mean (quantifies within-class spread, should be close to 1 for collapse).
Training Details: Experiments were conducted using SGD with standard practices like L2 regularization, momentum, learning rate annealing, and data augmentation (random cropping and horizontal flipping).
Hardware Considerations: Training deep neural networks with contrastive losses like SC can be computationally demanding. The batch size ( $b=256$ used in the paper) is crucial because the SC loss depends on the interactions within a batch. Larger batches generally provide more "negative" examples (samples from different classes), which is beneficial for learning, but require more memory and computation per iteration.

Limitations and Considerations

ReLU Non-negativity: The authors note that standard ResNets use ReLU activation before the final embedding layer, which forces representation coordinates to be non-negative. This constraint means the representations and class means are restricted to a cone, limiting their ability to achieve the theoretical minimum cosine similarity of $-1/(K-1)$ when $K$ is small (e.g., $K=10$ for CIFAR10, where $-1/9 \approx -0.11$ , but the minimum achievable non-negative inner product is 0, corresponding to cosine sim 0.5). Architectures without this constraint might converge closer to the theoretical optimum for low $K$ .
Theoretical Assumptions: The proofs for the optimal geometry rely on an "ideal encoder" that can realize any point configuration. Real-world neural network encoders have limited capacity and architectural constraints that can prevent reaching the true optimum. The empirical results show an approach to the simplex, not perfect attainment.
Batch Size for SC: The SC loss formulation inherently depends on the batch. The quality of learned representations can be sensitive to batch size, as larger batches provide richer sets of positive and negative pairs.

Conclusion

The paper provides valuable insights for practitioners choosing between CE and SC loss. While both aim for similar representation geometry at optimality, SC's batch-wise nature imparts beneficial implicit regularization, making it more robust to label noise during training compared to the instance-wise CE loss. Implementing SC requires normalizing the encoder outputs and considering the impact of batch size and temperature on the optimization dynamics. The empirical analysis provides strong evidence that SC's practical benefits likely stem from its distinct optimization trajectory, leading to better generalization and robustness.

PDF Markdown

Related Papers

Find Related Papers