Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models (1702.03275v2)

Published 10 Feb 2017 in cs.LG

Abstract: Batch Normalization is quite effective at accelerating and improving the training of deep models. However, its effectiveness diminishes when the training minibatches are small, or do not consist of independent samples. We hypothesize that this is due to the dependence of model layer inputs on all the examples in the minibatch, and different activations being produced between training and inference. We propose Batch Renormalization, a simple and effective extension to ensure that the training and inference models generate the same outputs that depend on individual examples rather than the entire minibatch. Models trained with Batch Renormalization perform substantially better than batchnorm when training with small or non-i.i.d. minibatches. At the same time, Batch Renormalization retains the benefits of batchnorm such as insensitivity to initialization and training efficiency.

Citations (518)

View on Semantic Scholar

Summary

The paper introduces Batch Renormalization to align training and inference by applying an affine correction to normalize activations.
The method demonstrates significant improvements, achieving 76.5% accuracy on small minibatches versus 74.2% with traditional batch normalization.
The approach maintains computational efficiency while reducing dependence on unreliable minibatch statistics, benefiting diverse deep learning architectures.

Batch Renormalization: An Enhanced Approach to Deep Model Training

This paper presents Batch Renormalization, a proposed extension of Batch Normalization (batchnorm) aimed at improving training efficiency in deep neural networks, particularly when faced with small or non-i.i.d. minibatches. Batch Normalization has become a staple for stabilizing and accelerating deep network training by normalizing the internal activations over minibatches. However, its dependency on minibatch statistics introduces challenges when the data does not meet the ideal conditions typically assumed for batchnorm, such as large, independent, identically distributed (i.i.d.) minibatches.

The Problem with Batch Normalization

Batch Normalization operates by normalizing activations based on the mean and variance computed from the minibatch during training. This dependency can become problematic:

Small Minibatches: With fewer samples, the estimates of mean and variance become unreliable, impairing model quality.
Non-i.i.d. Samples: For tasks like metric learning, where minibatches are biased by sampling strategies, batchnorm can lead to overfitting on the minibatch distribution rather than learning generalized representations.

During inference, moving averages of the activation statistics are used, which differ from those used during training. This discrepancy can destabilize training and impact model generalization.

Introducing Batch Renormalization

Batch Renormalization aims to align the training and inference statistics by introducing an affine transformation that corrects for the discrepancies between minibatch statistics and moving averages. The method does so while retaining important benefits of batchnorm:

Affine Transformation: A per-dimension correction involving scale and shift terms, denoted as $r$ and $d$ , is applied to the normalized activations. This transformation ensures that the activations depend only on individual examples rather than the entire minibatch.
Constancy in Expectation: The parameters $r$ and $d$ are calculated from the minibatch yet treated as constants during backward propagation. This modification ensures that the network activations expected during training match those during inference.

The algorithm is designed to operate with similar efficiency to batchnorm, maintaining training speed without additional computational complexity.

Experimental Results

Batch Renormalization was tested using the Inception-v3 model on the ImageNet dataset, with the following findings:

Baseline Comparison: When applied to minibatches of standard size (32), Batch Renormalization matched or slightly exceeded the performance of traditional batchnorm, achieving a validation accuracy of 78.5%.
Small Minibatches: With minibatches reduced to 4 examples, Batch Renormalization improved training speed and accuracy (76.5% at 130k steps) compared to batchnorm (74.2% at 210k steps), though larger minibatches still offered advantages.
Non-i.i.d. Minibatches: In scenarios with structured minibatch sampling (2 images per label), Batch Renormalization achieved the same level of accuracy (78.5%) as observed with i.i.d. minibatch sampling, significantly outperforming batchnorm which reached only 67%.

Implications and Future Work

Batch Renormalization provides a robust tool for use cases where batch normalization becomes ineffective, particularly with small or biased minibatch sampling. Its application extends to various architectures including Residual and Generative Adversarial Networks, offering potential improvements in situations where traditional batchnorm is less applicable.

Future explorations could investigate the applicability of Batch Renormalization to recurrent networks, potentially using consistent running averages for normalizing across timesteps without recalculating for each one individually.

In conclusion, Batch Renormalization represents a meaningful enhancement over batchnorm, offering consistent inference and training dynamics and facilitating improved performance across diverse training scenarios.

PDF Markdown