Cyclical Learning Rates for Training Neural Networks (1506.01186v6)

Published 3 Jun 2015 in cs.CV, cs.LG, and cs.NE

Abstract: It is known that the learning rate is the most important hyper-parameter to tune for training deep neural networks. This paper describes a new method for setting the learning rate, named cyclical learning rates, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates. Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations. This paper also describes a simple way to estimate "reasonable bounds" -- linearly increasing the learning rate of the network for a few epochs. In addition, cyclical learning rates are demonstrated on the CIFAR-10 and CIFAR-100 datasets with ResNets, Stochastic Depth networks, and DenseNets, and the ImageNet dataset with the AlexNet and GoogLeNet architectures. These are practical tools for everyone who trains neural networks.

Citations (2,378)

View on Semantic Scholar

Summary

The paper introduces CLR as a systematic approach that eliminates extensive hyper-parameter tuning by cycling learning rates.
The paper demonstrates that cyclical learning policies, particularly the triangular window, enhance performance across diverse datasets and architectures like ResNets and DenseNets.
The paper evidences practical benefits such as faster convergence and improved accuracy, reducing training iterations compared to fixed learning rates.

Cyclical Learning Rates for Training Neural Networks

Leslie N. Smith's paper "Cyclical Learning Rates for Training Neural Networks" introduces a novel method that addresses the critical issue of hyper-parameter tuning, particularly focusing on the learning rate. The paper presents the use of Cyclical Learning Rates (CLR), a technique designed to alleviate the necessity for extensive experimentation in finding optimal learning rate values and schedules.

Contribution to the Field

The paper identifies three core contributions:

Elimination of Extensive Hyper-parameter Tuning: CLR provides a systematic approach to setting global learning rates, thus eliminating the need for numerous experimental runs typically required to identify the optimal learning rates and schedules.
Benefits of Varying Learning Rates: Contrary to conventional wisdom, which suggests that learning rates should monotonically decrease, the research demonstrates that allowing learning rates to cyclically increase and decrease can enhance overall performance, even if it temporarily worsens during the cycle.
Practical Demonstration Across Architectures and Datasets: The efficacy of CLR is demonstrated using widely-recognized architectures like ResNets, Stochastic Depth networks, DenseNets on datasets such as CIFAR-10, CIFAR-100, and ImageNet with AlexNet and GoogLeNet models.

Methodology and Results

Cyclical Learning Rates (CLR)

CLR involves varying the learning rate within prescribed minimum and maximum boundaries. The research explores different cyclical functions, including triangular, parabolic, and sinusoidal windows, with the triangular window emerging as a simple yet effective choice.

Triangular Learning Rate Policy: The learning rate linearly increases to a maximum boundary from a minimum value and then returns to the minimum level in a cyclical manner. Variants like triangular2 reduce the learning rate range by half after each cycle, and exp_range apply an exponential decay to the rate boundaries.
Establishment of Boundaries: Reasonable minimum and maximum learning rates are determined via a "LR range test," where the learning rate linearly increases over a few epochs, allowing researchers to observe the learning dynamics and set appropriate bounds.

Experimental Analysis

The paper presents comprehensive experimental results:

CIFAR-10 and CIFAR-100 Datasets: Using architectures and parameters from Caffe, the CLR method (triangular2 policy) achieved the same accuracy (81.4%) in just 25,000 iterations compared to the standard 70,000 iterations. Additionally, adaptive learning rates combined with CLR often met their accuracy benchmarks more swiftly.
Residual Networks and Variants: Upon experimenting with ResNet, Stochastic Depth, and DenseNets, results consistently showed that CLR methods either matched or exceeded the performance of a fixed-rate policy. For instance, DenseNets with CLR on CIFAR-10 exhibited an average accuracy increase to 93.33% compared to 92.46% using fixed learning rates.
ImageNet with AlexNet and GoogLeNet: For the AlexNet architecture, the triangular2 policy slightly improved accuracy by 0.4% compared to fixed learning rates. For GoogleNet, CLR showed a considerable improvement, with the triangular and exp_range achieving significantly better accuracy swiftly.

Implications and Future Work

The practical implications of CLR are substantial, offering a near-optimal setting of learning rates that reduces the computational overhead and accelerates the training process. Theoretical implications include potential insights into the dynamics of stochastic gradient descent and its susceptibility to saddle points within high-dimensional spaces.

Moving forward, further exploration could include applying CLR to recurrent neural networks and examining its effects across different domains. The theoretical basis of CLR could also be expanded to provide a deeper understanding of the underlying mechanics.

Overall, Leslie N. Smith's research presents a convincing argument for the adoption of cyclical learning rates in neural network training regimes, providing a tool that simplifies hyper-parameter tuning while enhancing performance.

PDF Markdown

Related Papers

GitHub

GitHub - bckenstler/CLR (1,209 stars)

Tweets

https://twitter.com/MadeSimple63337/status/1905933061390266508

https://twitter.com/robertsdionne/status/1794057826224206247

YouTube

Show All Videos