When Does Label Smoothing Help? (1906.02629v3)

Published 6 Jun 2019 in cs.LG and stat.ML

Abstract: The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.

Authors (3)

Rafael Müller (3 papers)
Simon Kornblith (53 papers)
Geoffrey Hinton (38 papers)

Citations (1,805)

View on Semantic Scholar

Summary

Analyzing the Efficiency and Effects of Label Smoothing in Neural Networks

The paper "When Does Label Smoothing Help?" by Rafael Müller, Simon Kornblith, and Geoffrey Hinton explores the practice of label smoothing (LS) in neural networks, aiming to elucidate its empirical benefits and underlying mechanics. This paper is particularly insightful for its methodical approach in examining how LS aids generalization, model calibration, and its less-discussed effect on knowledge distillation.

Label Smoothing: Mechanisms and Representations

Label smoothing modifies the training process by blending the hard target labels with a uniform distribution across all labels. The authors argue that this technique prevents the network from becoming overconfident, which is a common issue in neural network training. However, the paper goes beyond this surface understanding by visualizing how LS alters the structure of the learned representations.

The penultimate layer representations are critical in understanding the internal mechanics of this regularization technique. The authors present a visualization method revealing that LS prompts training samples to form tighter, more equidistant clusters. This induces a uniform spread of class logit vectors, reducing the likelihood of overconfident misclassifications. For varying datasets like CIFAR-10, CIFAR-100, and ImageNet, the visualizations consistently show that LS enforces a geometric regularity amongst penultimate layer activations, which sustains its generalization capabilities across different network architectures.

Model Calibration and Practical Implications

Calibration of neural networks is pivotal for downstream tasks such as beam-search in LLMs or decision-making processes in critical applications. The paper empirically shows that networks trained with LS are better calibrated than those trained with unaltered hard targets. Calibration is measured using the Expected Calibration Error (ECE), where LS significantly reduces the ECE compared to uncalibrated models.

From a practical standpoint, LS imbues predictive confidence in a model's output, aligning the predicted probabilities more tightly with the observed accuracies. This has notable implications:

Image Classification: LS effectively calibrates ResNet-56 trained on CIFAR-100 and Inception-v4 on ImageNet, producing outputs where predictive probabilities are more reliable.
Machine Translation: The Transformer architecture benefits from LS in training, yielding higher BLEU scores and better calibration metrics compared to networks trained with hard targets.

Analyzing the Impact on Knowledge Distillation

Knowledge distillation (KD) involves training a student network to mimic a teacher network's outputs. This process typically benefits from soft target probabilities that encode more information than hard labels. Intriguingly, the paper finds that teachers trained with LS, despite exhibiting better individual performance, impair the KD process. The cause is attributed to the loss of inter-class logit information due to the induced uniformity in predictions enforced by LS. This reduction in mutual information between inputs and logits results in the student network inheriting a less informative signal.

Future Directions

The interplay between LS and knowledge distillation opens new avenues for investigation. Future research could explore hybrid techniques where LS is selectively applied or dynamically adjusted to balance improved generalization with retained inter-class relational information essential for KD.

Additionally, the paper's findings about mutual information and model representations might lead to innovations in regularization and transfer learning. The visualizations and analytical techniques introduced here could be extended to different architectures and tasks, providing a general framework for understanding various regularization approaches.

In summary, this paper presents comprehensive empirical evidence on the benefits of LS in neural network training while shedding light on its implications for model calibration and knowledge distillation. The visualization and interpretative methods proposed offer valuable tools for future research in neural network optimization and regularization techniques.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos