Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Train longer, generalize better: closing the generalization gap in large batch training of neural networks (1705.08741v2)

Published 24 May 2017 in stat.ML and cs.LG

Abstract: Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the "generalization gap" phenomena. Identifying the origin of this gap and closing it had remained an open problem. Contributions: We examine the initial high learning rate training phase. We find that the weight distance from its initialization grows logarithmically with the number of weight updates. We therefore propose a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior. Following this hypothesis we conducted experiments to show empirically that the "generalization gap" stems from the relatively small number of updates rather than the batch size, and can be completely eliminated by adapting the training regime used. We further investigate different techniques to train models in the large-batch regime and present a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates. To validate our findings we conduct several additional experiments on MNIST, CIFAR-10, CIFAR-100 and ImageNet. Finally, we reassess common practices and beliefs concerning training of deep models and suggest they may not be optimal to achieve good generalization.

Closing the Generalization Gap in Large Batch Training of Neural Networks

The paper "Train longer, generalize better: closing the generalization gap in large batch training of neural networks" by Elad Hoffer, Itay Hubara, and Daniel Soudry explores the persistent issue of the generalization gap encountered while training deep neural networks (DNNs) with large batch sizes. This phenomenon, where models trained with large batches tend to generalize poorly despite extensive training, has significant implications for the scalability and efficiency of modern deep learning.

Key Contributions

The authors investigate the underpinnings of the generalization gap and propose strategies to mitigate it, highlighting:

  1. Random Walk on Random Landscape Model: The paper introduces a statistical model likened to a "random walk on a random landscape," explaining the ultra-slow diffusion behavior of weights during the training phase. They empirically validate that the weight distance from initialization grows logarithmically with the number of updates, irrespective of batch size.
  2. Empirical Evidence: Through a series of experiments on datasets like MNIST, CIFAR-10, CIFAR-100, and ImageNet, the authors demonstrate that the so-called generalization gap arises from fewer updates rather than the batch size itself.
  3. Ghost Batch Normalization (GBN): A novel algorithm called Ghost Batch Normalization is introduced to compute batch statistics over smaller partitions or "ghost batches" within the large batch. This method significantly reduces the generalization gap without increasing the number of updates.
  4. Adapted Training Regime: The authors recommend adapting the training regime such that large-batch training can generalize as effectively as small-batch training by maintaining a similar number of gradient updates.

Theoretical Analysis

A significant portion of the paper focuses on the theoretical analysis of the random walk behavior of SGD-based training:

  • Diffusion Analysis: In complex systems like DNNs, the high-dimensional random walk of weights on a random potential field exhibits "ultra-slow" diffusion, determined by the potential's auto-covariance function. This model predicts that to explore wider local minima, necessary for better generalization, one must ensure a sufficient number of updates.
  • Empirical Validation: The empirical observation that the distance of weights from initialization grows logarithmically across different batch sizes supports the proposed model. Notably, smaller batch sizes entail more updates due to more frequent iteration cycles, thus explaining their better generalization historically.

Practical Implications

The practical implications of this research are profound for the field of deep learning:

  • Scalability of Training: The findings suggest that the barriers to effective large-batch training can be mitigated by adjusting the learning rate and adopting Ghost Batch Normalization, thereby enabling better utilization of parallel and high-throughput computing infrastructure.
  • Improvement of Generalization: By adapting training regimes to extend the initial high learning-rate phase, model training can exploit the high-diffusion regime over more iterations, leading to improved generalization even with large batches.

Recommendations and Future Research

The recommendations derived from this paper include:

  1. Learning Rate Adjustment: Scale the learning rate proportionally to the square root of the batch size.
  2. Ghost Batch Normalization: Implement GBN to independently compute normalization statistics on smaller partitions within the large batch.
  3. Training Regime Adaptation: Extend the duration of the initial training phase with a high learning rate to allow the model to explore a larger parameter space effectively.

Conclusion

The paper challenges the previously held notion that large-batch training inherently leads to poorer generalization. It provides a theoretical grounding and practical methods to bridge the generalization gap, showing that, with the right training regime, large-batch training can achieve generalization performance comparable to or better than small-batch training. Future work may explore optimizing learning rate schedules dynamically and exploring the impact of these methods across varied neural network architectures and datasets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Elad Hoffer (23 papers)
  2. Itay Hubara (19 papers)
  3. Daniel Soudry (76 papers)
Citations (768)