Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Relationship between Batch Size and Number of Steps Needed for Nonconvex Optimization of Stochastic Gradient Descent using Armijo Line Search (2307.13831v4)

Published 25 Jul 2023 in cs.LG and math.OC

Abstract: While stochastic gradient descent (SGD) can use various learning rates, such as constant or diminishing rates, the previous numerical results showed that SGD performs better than other deep learning optimizers using when it uses learning rates given by line search methods. In this paper, we perform a convergence analysis on SGD with a learning rate given by an Armijo line search for nonconvex optimization indicating that the upper bound of the expectation of the squared norm of the full gradient becomes small when the number of steps and the batch size are large. Next, we show that, for SGD with the Armijo-line-search learning rate, the number of steps needed for nonconvex optimization is a monotone decreasing convex function of the batch size; that is, the number of steps needed for nonconvex optimization decreases as the batch size increases. Furthermore, we show that the stochastic first-order oracle (SFO) complexity, which is the stochastic gradient computation cost, is a convex function of the batch size; that is, there exists a critical batch size that minimizes the SFO complexity. Finally, we provide numerical results that support our theoretical results. The numerical results indicate that the number of steps needed for training deep neural networks decreases as the batch size increases and that there exist the critical batch sizes that can be estimated from the theoretical results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Amir Beck. First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017.
  2. Stochastic gradient descent in correlated settings: A study on Gaussian processes. In Advances in Neural Information Processing Systems, volume 33, 2020.
  3. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
  4. Convergence rates for the stochastic gradient descent method for non-convex objective functions. Journal of Machine Learning Research, 21:1–48, 2020.
  5. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22:1469–1492, 2012.
  6. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization II: Shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23:2061–2089, 2013.
  7. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, volume 30, 2017.
  8. Hideaki Iiduka. Critical bach size minimizes stochastic first-order oracle complexity of deep learning optimizer using hyperparameters close to one. arXiv: 2208.09814, 2022.
  9. Parallelizing stochastic gradient descent for least squares regression: Mini-batching, averaging, and model misspecification. Journal of Machine Learning Research, 18(223):1–42, 2018.
  10. Adam: A method for stochastic optimization. In Proceedings of The International Conference on Learning Representations, 2015.
  11. Stochastic polyak step-size for SGD: An adaptive learning rate for fast convergence. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, volume 130, 2021.
  12. Decoupled weight decay regularization. In Proceedings of The International Conference on Learning Representations, 2019.
  13. Conjugate gradient method for generative adversarial networks. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp.  4381–4408. PMLR, 2023.
  14. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19:1574–1609, 2009.
  15. Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O⁢(1/k2)𝑂1superscript𝑘2{O}(1/k^{2})italic_O ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Doklady AN USSR, 269:543–547, 1983.
  16. J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York, 2nd edition, 2006.
  17. Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4:1–17, 1964.
  18. On the convergence of Adam and beyond. In Proceedings of The International Conference on Learning Representations, 2018.
  19. A stochastic approximation method. The Annals of Mathematical Statistics, 22:400–407, 1951.
  20. Existence and estimation of critical batch size for training generative adversarial networks with two time-scale update rule. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  30080–30104. PMLR, 23–29 Jul 2023.
  21. Robustness analysis of non-convex stochastic gradient descent using biased expectations. In Advances in Neural Information Processing Systems, volume 33, 2020.
  22. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20:1–49, 2019.
  23. RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4:26–31, 2012.
  24. Painless stochastic gradient: Interpolation, line-search, and convergence rates. In Advances in Neural Information Processing Systems, volume 32, 2019.
  25. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Advances in Neural Information Processing Systems, volume 31, 2018.
  26. Wide residual networks. arXiv: 1605.07146, 2017.
  27. Which algorithmic choices matter at which batch sizes? Insights from a noisy quadratic model. In Advances in Neural Information Processing Systems, volume 32, 2019.
  28. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pp.  928–936, 2003.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yuki Tsukada (1 paper)
  2. Hideaki Iiduka (34 papers)