Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic (1910.08597v5)
Abstract: This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easy-to-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam.
- Dissecting adam: The sign, magnitude and variance of stochastic gradients. arXiv preprint arXiv:1705.07774, 2017.
- Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.
- Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade, pp. 437–478. Springer, 2012.
- Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
- Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.
- Convergence diagnostics for stochastic gradient descent with constant learning rate. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 1476–1485. PMLR, 09–11 Apr 2018.
- Equilibrated adaptive learning rates for non-convex optimization. In Advances in neural information processing systems, pp. 1504–1512, 2015.
- Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp. 1504–1513, 2017.
- Accelerated stochastic approximation. SIAM Journal on Optimization, 3(4):868–881, 1993.
- Bridging the gap between constant step size stochastic gradient descent and markov chains. arXiv preprint arXiv:1707.06386, 2017.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
- On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Using statistics to automate stochastic optimization. arXiv preprint arXiv:1909.09785, 2019.
- Batch size matters: A diffusion approximation framework on nonconvex stochastic gradient descent. stat, 1050:22, 2017.
- Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 661–670. ACM, 2014.
- Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843, 2019.
- Building a large annotated corpus of english: The penn treebank. 1993.
- Central limit theorems for additive functionals of markov chains. Annals of probability, pp. 713–724, 2000.
- An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018.
- Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp. 451–459, 2011.
- Noboru Murata. A statistical study of on-line learning. Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, pp. 63–92, 1998.
- Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems, pp. 1017–1025, 2014.
- On convergence-diagnostic based step sizes for stochastic gradient descent. arXiv preprint arXiv:2007.00534, 2020.
- Georg Ch Pflug. Non-asymptotic confidence bounds for stochastic approximation algorithms with constant step size. Monatshefte für Mathematik, 110(3-4):297–314, 1990.
- Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
- Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
- A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
- David Ruppert. Efficient estimations from a slowly convergent Robbins–Monro process. Technical report, Operations Research and Industrial Engineering, Cornell University, Ithaca, NY, 1988.
- Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
- Uncertainty quantification for online learning and stochastic approximation via hierarchical incremental gradient descent. arXiv preprint arXiv:1802.04876, 2018.
- On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147, 2013.
- Barzilai-borwein step size for stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 685–693, 2016.
- Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- Sho Yaida. Fluctuation-dissipation relations for stochastic gradient descent. In ICLR, 2019.
- Gradient diversity: a key ingredient for scalable distributed learning. arXiv preprint arXiv:1706.05699, 2017.
- George Yin. Stopping times for stochastic approximation. In Modern Optimal Control: A Conference in Honor of Solomon Lefschetz and Joseph P. LaSalle, pp. 409–420, 1989.
- Yellowfin and the art of momentum tuning. arXiv preprint arXiv:1706.03471, 2017.