Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic (1910.08597v5)

Published 18 Oct 2019 in stat.ML, cs.LG, math.OC, and stat.ME

Abstract: This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easy-to-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Dissecting adam: The sign, magnitude and variance of stochastic gradients. arXiv preprint arXiv:1705.07774, 2017.
  2. Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086, 2016.
  3. Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade, pp.  437–478. Springer, 2012.
  4. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
  5. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.
  6. Convergence diagnostics for stochastic gradient descent with constant learning rate. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp.  1476–1485. PMLR, 09–11 Apr 2018.
  7. Equilibrated adaptive learning rates for non-convex optimization. In Advances in neural information processing systems, pp. 1504–1512, 2015.
  8. Automated inference with adaptive batches. In Artificial Intelligence and Statistics, pp.  1504–1513, 2017.
  9. Accelerated stochastic approximation. SIAM Journal on Optimization, 3(4):868–881, 1993.
  10. Bridging the gap between constant step size stochastic gradient descent and markov chains. arXiv preprint arXiv:1707.06386, 2017.
  11. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  12. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  13. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
  14. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  15. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  16. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  17. Using statistics to automate stochastic optimization. arXiv preprint arXiv:1909.09785, 2019.
  18. Batch size matters: A diffusion approximation framework on nonconvex stochastic gradient descent. stat, 1050:22, 2017.
  19. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  661–670. ACM, 2014.
  20. Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843, 2019.
  21. Building a large annotated corpus of english: The penn treebank. 1993.
  22. Central limit theorems for additive functionals of markov chains. Annals of probability, pp.  713–724, 2000.
  23. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018.
  24. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp. 451–459, 2011.
  25. Noboru Murata. A statistical study of on-line learning. Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, pp.  63–92, 1998.
  26. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems, pp. 1017–1025, 2014.
  27. On convergence-diagnostic based step sizes for stochastic gradient descent. arXiv preprint arXiv:2007.00534, 2020.
  28. Georg Ch Pflug. Non-asymptotic confidence bounds for stochastic approximation algorithms with constant step size. Monatshefte für Mathematik, 110(3-4):297–314, 1990.
  29. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
  30. Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
  31. A stochastic approximation method. The annals of mathematical statistics, pp.  400–407, 1951.
  32. David Ruppert. Efficient estimations from a slowly convergent Robbins–Monro process. Technical report, Operations Research and Industrial Engineering, Cornell University, Ithaca, NY, 1988.
  33. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
  34. Uncertainty quantification for online learning and stochastic approximation via hierarchical incremental gradient descent. arXiv preprint arXiv:1802.04876, 2018.
  35. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147, 2013.
  36. Barzilai-borwein step size for stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 685–693, 2016.
  37. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
  38. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  39. Sho Yaida. Fluctuation-dissipation relations for stochastic gradient descent. In ICLR, 2019.
  40. Gradient diversity: a key ingredient for scalable distributed learning. arXiv preprint arXiv:1706.05699, 2017.
  41. George Yin. Stopping times for stochastic approximation. In Modern Optimal Control: A Conference in Honor of Solomon Lefschetz and Joseph P. LaSalle, pp.  409–420, 1989.
  42. Yellowfin and the art of momentum tuning. arXiv preprint arXiv:1706.03471, 2017.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets