Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods (2402.11215v3)

Published 17 Feb 2024 in cs.LG, math.OC, and stat.ML

Abstract: The choice of batch sizes in minibatch stochastic gradient optimizers is critical in large-scale model training for both optimization and generalization performance. Although large-batch training is arguably the dominant training paradigm for large-scale deep learning due to hardware advances, the generalization performance of the model deteriorates compared to small-batch training, leading to the so-called "generalization gap" phenomenon. To mitigate this, we investigate adaptive batch size strategies derived from adaptive sampling methods, originally developed only for stochastic gradient descent. Given the significant interplay between learning rates and batch sizes, and considering the prevalence of adaptive gradient methods in deep learning, we emphasize the need for adaptive batch size strategies in these contexts. We introduce AdAdaGrad and its scalar variant AdAdaGradNorm, which progressively increase batch sizes during training, while model updates are performed using AdaGrad and AdaGradNorm. We prove that AdAdaGradNorm converges with high probability at a rate of $\mathscr{O}(1/K)$ to find a first-order stationary point of smooth nonconvex functions within $K$ iterations. AdAdaGrad also demonstrates similar convergence properties when integrated with a novel coordinate-wise variant of our adaptive batch size strategies. We corroborate our theoretical claims by performing image classification experiments, highlighting the merits of the proposed schemes in terms of both training efficiency and model generalization. Our work unveils the potential of adaptive batch size strategies for adaptive gradient optimizers in large-scale model training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Linear attention is (maybe) all you need (to understand transformer optimization). arXiv preprint arXiv:2310.01082, 2023.
  2. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1-2):165–214, 2023.
  3. SGD with AdaGrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. In Proceedings of the International Conference on Machine Learning (ICML), 2023.
  4. Coupling adaptive batch sizes with learning rates. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2017.
  5. Adaptive sampling strategies for risk-averse stochastic optimization with constraints. IMA Journal of Numerical Analysis, page drac083, 2023.
  6. An adaptive sampling sequential quadratic programming method for equality constrained stochastic optimization. arXiv preprint arXiv:2206.00712, 2022.
  7. Adaptive sampling quasi-Newton methods for zeroth-order stochastic optimization. Mathematical Programming Computation, 15(2):327–364, 2023.
  8. Adaptive sampling strategies for stochastic optimization. SIAM Journal on Optimization, 28(4):3312–3343, 2018a.
  9. A progressive batching L-BFGS method for machine learning. In Proceedings of the International Conference on Machine Learning (ICML), 2018b.
  10. An adaptive sampling augmented Lagrangian method for stochastic optimization with deterministic constraints. arXiv preprint arXiv:2305.01018, 2023.
  11. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
  12. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  13. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127–155, 2012.
  14. Richard G. Carter. On the global convergence of trust region algorithms using inexact gradient information. SIAM Journal on Numerical Analysis, 28(1):251–265, 1991.
  15. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, 169:337–375, 2018.
  16. Big batch SGD: Automated inference using adaptive batch sizes. arXiv preprint arXiv:1610.05792, 2016.
  17. Automated inference with adaptive batches. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
  18. A simple convergence proof of Adam and Adagrad. Transactions on Machine Learning Research, 2022.
  19. Adabatch: Adaptive batch sizes for training deep neural networks. arXiv preprint arXiv:1712.02029, 2017.
  20. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
  21. William Falcon and The PyTorch Lightning team. PyTorch Lightning, 2019. URL https://github.com/Lightning-AI/lightning. Version 2.0.8.
  22. The power of adaptivity in SGD: Self-tuning step sizes with unbounded gradients and affine variance. In Proceedings of the Conference on Learning Theory (COLT), 2022.
  23. Beyond uniform smoothness: A stopped analysis of adaptive SGD. In Proceedings of the Conference on Learning Theory (COLT), 2023.
  24. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.
  25. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  26. Variance-reduced methods for machine learning. Proceedings of the IEEE, 108(11):1968–1983, 2020.
  27. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  28. Learning rates as a function of batch size: A random matrix theory approach to neural network training. Journal of Machine Learning Research, 23(173):1–65, 2022.
  29. Training compute-optimal large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  30. How does adaptive optimization impact local neural network geometry? arXiv preprint arXiv:2211.02254, 2022.
  31. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (NeurIPS), 2013.
  32. AdaScale SGD: A user-friendly algorithm for distributed training. In Proceedings of the International Conference on Machine Learning (ICML), 2020.
  33. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  34. High probability bounds for a class of nonconvex algorithms with AdaGrad stepsize. In International Conference on Learning Representations (ICLR), 2022.
  35. Better theory for SGD in the nonconvex world. Transactions on Machine Learning Research, 2023.
  36. Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  37. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  38. Noise is not the main factor behind the gap between SGD and Adam on transformers, but sign descent might be. In International Conference on Learning Representations (ICLR), 2023.
  39. MNIST handwritten digit database, 1998. URL http://yann.lecun.com/exdb/mnist.
  40. PyTorch distributed: Experiences on accelerating data parallel training. In Proceedings of the VLDB Endowment, 2020.
  41. On the convergence of stochastic gradient descent with adaptive stepsizes. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
  42. A high probability analysis of adaptive SGD with momentum. In Workshop on Beyond First Order Methods in ML Systems at ICML’20, 2020.
  43. High probability convergence of stochastic gradient methods. In Proceedings of the International Conference on Machine Learning (ICML), 2023.
  44. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018.
  45. Adaptive bound optimization for online convex optimization. In Proceedings of the Conference on Learning Theory (COLT), 2010.
  46. Toward understanding why Adam converges faster than SGD for transformers. arXiv preprint arXiv:2306.00204, 2023.
  47. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  48. SimiGrad: Fine-grained adaptive batching for large scale training using gradient similarity measurement. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  49. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446, 2021.
  50. ZeRO: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  51. Stochastic variance reduction for nonconvex optimization. In Proceedings of the International Conference on Machine Learning (ICML), 2016.
  52. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
  53. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112):1–49, 2019.
  54. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  55. A Bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations (ICLR), 2018.
  56. Don’t decay the learning rate, increase the batch size. In International Conference on Learning Representations (ICLR), 2018.
  57. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  58. Less regret via online conditioning. arXiv preprint arXiv:1002.4862, 2010.
  59. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
  60. Convergence of AdaGrad for non-convex objectives: Simple proofs and relaxed assumptions. In Proceedings of the Conference on Learning Theory (COLT), 2023.
  61. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes. In Proceedings of the International Conference on Machine Learning (ICML), 2019.
  62. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes. Journal of Machine Learning Research, 21(219):1–30, 2020.
  63. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  64. Constrained and composite optimization via adaptive sampling methods. IMA Journal of Numerical Analysis, page drad020, 2023.
  65. Large batch optimization for deep learning: Training BERT in 76 minutes. In International Conference on Learning Representations (ICLR), 2020.
  66. Matthew D. Zeiler. ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
  67. Improved analysis of clipping algorithms for non-convex optimization. In Advances in Neural Information Processing Systems (NeurIPS), 2020a.
  68. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  69. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations (ICLR), 2020b.
  70. Why are adaptive methods good for attention models? In Advances in Neural Information Processing Systems (NeurIPS), 2020c.
  71. PyTorch FSDP: Experiences on scaling fully sharded data parallel. In Proceedings of the VLDB Endowment, 2023.
  72. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com