Papers
Topics
Authors
Recent
Search
2000 character limit reached

Regularizing Neural Networks by Penalizing Confident Output Distributions

Published 23 Jan 2017 in cs.NE and cs.LG | (1701.06548v1)

Abstract: We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence. We exhaustively evaluate the proposed confidence penalty and label smoothing on 6 common benchmarks: image classification (MNIST and Cifar-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). We find that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers.

Citations (1,082)

Summary

  • The paper introduces confidence penalty and label smoothing as innovative regularization methods that prevent overfitting by encouraging high-entropy predictions.
  • It demonstrates measurable improvements in accuracy, perplexity, and error rates across benchmarks like MNIST, CIFAR-10, Penn Treebank, WMT'14, TIMIT, and WSJ.
  • The study highlights that these techniques simplify training and are broadly applicable for enhancing neural network generalization on diverse tasks.

Regularizing Neural Networks by Penalizing Confident Output Distributions

Introduction

The paper Regularizing Neural Networks by Penalizing Confident Output Distributions investigates an alternative method of regularizing neural networks to improve their generalization capabilities. Instead of traditional regularization techniques that act on weights or hidden activations, this research proposes to regularize the output distribution of neural networks by penalizing low entropy distributions, thereby preventing overly confident predictions.

Main Contributions

The study provides a comprehensive examination of two output regularization methods:

  1. Confidence Penalty: This method penalizes low entropy output distributions by adding a regularization term to the objective function that encourages higher entropy in the network's predictions.
  2. Label Smoothing: This technique introduces noise into the training labels or smooths them by mixing with a uniform or a unigram distribution, thereby preventing the network from assigning maximum probability to any single class.

Experimental Evaluation

The authors conducted an extensive empirical analysis across six common benchmarks:

  • Image classification (MNIST, CIFAR-10)
  • Language modeling (Penn Treebank)
  • Machine translation (WMT'14 English-to-German)
  • Speech recognition (TIMIT, WSJ)

The key findings are summarized as follows:

Image Classification

MNIST

For the MNIST dataset, the proposed methods were tested using a fully connected neural network with 1024 units per layer. The results showed that both label smoothing and the confidence penalty outperform traditional dropout, achieving test errors of 1.23±0.06%1.23 \pm 0.06\% and 1.17±0.06%1.17 \pm 0.06\% respectively.

CIFAR-10

Using a densely connected convolutional network, both regularization techniques improved performance. The confidence penalty achieved a test error rate of 6.77%, marginally better than dropout alone.

Language Modeling

The experiments on the Penn Treebank dataset showed that the confidence penalty performed significantly better than label noise and label smoothing, reducing the test perplexity to 74.7 compared to 77.7 and 76.6 for label noise and smoothing respectively.

Machine Translation

On the WMT'14 English-to-German translation task, the study demonstrated that both label smoothing and the confidence penalty enhanced the BLEU scores. When combined with dropout, label smoothing slightly improved the BLEU score to 23.57, compared to 23.41 with dropout alone.

Speech Recognition

TIMIT

The application of label smoothing and confidence penalty in speech recognition on the TIMIT dataset showed a reduction in phoneme error rates (PER). Label smoothing achieved the lowest PER of 21.6%, compared to 23.2% with dropout alone.

WSJ

For the WSJ corpus, the confidence penalty and label smoothing reduced the word error rates (WER). Unigram label smoothing resulted in the best performance, lowering the WER from 14.2 to 11.0.

Discussion and Implications

The study's results underline the effectiveness of output regularization methods in enhancing the generalization performance of neural networks across diverse tasks. By penalizing confident predictions, both label smoothing and confidence penalty techniques help to prevent overfitting and maintain a well-calibrated output distribution.

The confidence penalty simplifies model training by operating on output distributions, which are parameterization invariant and therefore easier to optimize across different architectures. Label smoothing, while conceptually similar, can be adapted to various distributions, presenting opportunities for future exploration.

Future Directions

Future work might explore alternative target distributions for label smoothing beyond uniform or unigram distributions. Additionally, integrating these regularization methods with other advanced training techniques and experimenting with different neural network architectures could offer further insights.

Conclusion

This paper systematically evaluates and demonstrates the efficacy of two output regularization techniques across several supervised learning benchmarks. Both label smoothing and the confidence penalty show consistent improvements in model performance, suggesting their wide applicability and effectiveness in combating overfitting in large, deep neural networks.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 67 likes about this paper.