Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Batch Training of Convolutional Networks (1708.03888v3)

Published 13 Aug 2017 in cs.CV

Abstract: A common way to speed up training of large convolutional networks is to add computational units. Training is then performed using data-parallel synchronous Stochastic Gradient Descent (SGD) with mini-batch divided between computational units. With an increase in the number of nodes, the batch size grows. But training with large batch size often results in the lower model accuracy. We argue that the current recipe for large batch training (linear learning rate scaling with warm-up) is not general enough and training may diverge. To overcome this optimization difficulties we propose a new training algorithm based on Layer-wise Adaptive Rate Scaling (LARS). Using LARS, we scaled Alexnet up to a batch size of 8K, and Resnet-50 to a batch size of 32K without loss in accuracy.

Large Batch Training of Convolutional Networks

The paper presents an important exploration of large batch training in Convolutional Neural Networks (CNNs). The primary focus is on optimizing training procedures to improve efficiency and performance when utilizing a significantly increased batch size. The authors argue that the common practice of using linear learning rate scaling with a warm-up period is insufficient for effectively scaling to larger batch sizes, as it often results in instability and divergence during training.

Introduction

In the domain of CNNs, training is computationally intensive and often time-consuming. A common strategy to expedite this process involves increasing computational resources, such as deploying multiple GPUs and distributing the computational load via data-parallel synchronous Stochastic Gradient Descent (SGD). However, this method inherently increases the batch size, which can detrimentally impact model accuracy due to fewer weight updates per epoch proportional to the batch size increment.

Traditional approaches, such as the linear scaling of the learning rate (LR) with an initial warm-up phase, have had limited success. For instance, while this approach enabled the successful training of ResNet-50 with a batch size of 8K, further scaling led to training divergence and reduced accuracies in networks such as AlexNet.

Layer-wise Adaptive Rate Scaling (LARS)

The authors propose a novel training algorithm, Layer-wise Adaptive Rate Scaling (LARS), to address the optimization difficulty posed by large batch sizes. The LARS algorithm uniquely determines separate learning rates for each layer within the network based on the norm of the layer weights relative to the norm of the gradient updates. This method contrasts with adaptive algorithms like ADAM or RMSProp by focusing on the layer level rather than individual weights, thereby enhancing stability and control over the training process.

Using LARS, the authors successfully scaled AlexNet up to a batch size of 8K and ResNet-50 to a batch size of 32K without any loss in accuracy, demonstrating the algorithm's robustness and effectiveness in maintaining model performance at larger scales.

Training Analysis and Results

Initially, the authors observed that increasing the LR in standard training, even with warm-up, resulted in significant accuracy drops when scaling beyond a certain batch size. For example, AlexNet’s accuracy plummeted from 57.6% at a batch size of 256 to 44.8% at a batch size of 8K. By incorporating Batch Normalization (BN) in place of Local Response Normalization (LRN) layers, the network showed improved resilience against large LRs, although still not achieving baseline accuracy.

The proposed LARS method significantly mitigates these issues. By dynamically adjusting the LRs for each layer, LARS maintains training stability and model performance across substantially larger batch sizes. The empirical results showcased that AlexNet-BN retained an accuracy of 58.0% at a batch size of 8K using LARS.

Theoretical and Practical Implications

The introduction of LARS presents a pivotal advancement in the training of CNNs using large batches. Theoretically, it challenges the adequacy of linear LR scaling with warm-up and provides a more sophisticated approach tailored to layer-specific dynamics. Practically, this enhances the feasibility of training large-scale neural networks on modern hardware clusters, significantly accelerating the training time without compromising the model accuracy.

Future Directions

The research opens several avenues for further exploration. Extending LARS to other neural network architectures and tasks will help validate its general applicability. Additionally, examining the interplay between different types of normalization techniques and LARS could unlock further performance optimizations. The fundamental improvements to training stability and efficiency may drive advancements in real-time AI applications and enable the scaling of networks to unprecedented sizes.

Conclusion

The paper provides a compelling solution to the challenges involved in large batch training of CNNs. The Layer-wise Adaptive Rate Scaling (LARS) method successfully scales up the batch size without loss of accuracy, thereby setting a new paradigm in network training optimization. Moving forward, LARS could catalyze significant strides in machine learning research and its applications, propelling both theoretical advancements and practical implementations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yang You (173 papers)
  2. Igor Gitman (15 papers)
  3. Boris Ginsburg (111 papers)
Citations (805)