Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method (2305.16284v4)

Published 25 May 2023 in cs.LG, math.OC, and stat.ML

Abstract: This paper proposes a new easy-to-implement parameter-free gradient-based optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is efficient -- matching the convergence rate of optimally tuned gradient descent in convex optimization up to a logarithmic factor without tuning any parameters, and universal -- automatically adapting to both smooth and nonsmooth problems. While popular algorithms following the AdaGrad framework compute a running average of the squared gradients to use for normalization, DoWG maintains a new distance-based weighted version of the running average, which is crucial to achieve the desired properties. To complement our theory, we also show empirically that DoWG trains at the edge of stability, and validate its effectiveness on practical machine learning tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Understanding gradient descent on edge of stability in deep learning. arXiv preprint arXiv:2205.09745, abs/2205.09745, 2022. URL https://arXiv.org/abs/2205.09745.
  2. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018. doi: 10.1137/16M1080173. URL https://doi.org/10.1137/16M1080173.
  3. Convex Optimization. Cambridge University Press, 2004. doi: 10.1017/CBO9780511804441.
  4. Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015. doi: 10.1561/2200000050. URL https://doi.org/10.1561/2200000050.
  5. Making SGD parameter-free. In Po-Ling Loh and Maxim Raginsky, editors, Conference on Learning Theory, 2-5 July 2022, London, UK, volume 178 of Proceedings of Machine Learning Research, pages 2360–2389. PMLR, 2022. URL https://proceedings.mlr.press/v178/carmon22a.html.
  6. Prediction, Learning, and Games. Cambridge University Press, USA, 2006. ISBN 0521841089.
  7. How to use expert advice. J. ACM, 44(3):427–485, may 1997. ISSN 0004-5411. doi: 10.1145/258128.258179. URL https://doi.org/10.1145/258128.258179.
  8. LIBSVM: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
  9. Gradient descent on neural networks typically occurs at the edge of stability. In ICLR. OpenReview.net, 2021.
  10. Adaptive gradient methods at the edge of stability. arXiv preprint arXiv:2207.14484, abs/2207.14484, 2022. URL https://arXiv.org/abs/2207.14484.
  11. Ashok Cutkosky. Artificial constraints and hints for unbounded online learning. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 874–894. PMLR, 25–28 Jun 2019. URL https://proceedings.mlr.press/v99/cutkosky19a.html.
  12. Self-stabilization: The implicit bias of gradient descent at the edge of stability. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=nhKHA59gXz.
  13. Learning-rate-free learning by D-adaptation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 7449–7479. PMLR, 23–29 Jul 2023.
  14. Adaptive subgradient methods for online learning and stochastic optimization. In Adam Tauman Kalai and Mehryar Mohri, editors, COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 257–269. Omnipress, 2010. URL http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdf#page=265.
  15. Adaptive gradient methods for constrained convex optimization and variational inequalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7314–7321, 2021.
  16. Practical Methodology, chapter 11. MIT Press, 2016. URL http://www.deeplearningbook.org.
  17. Benjamin Grimmer. Convergence rates for deterministic and stochastic subgradient methods without lipschitz continuity. SIAM Journal on Optimization, 29(2):1350–1365, 2019. doi: 10.1137/18M117306X. URL https://doi.org/10.1137/18M117306X.
  18. Benjamin Grimmer. On optimal universal first-order methods for minimizing heterogeneous sums. arXiv preprint arXiv:2208.08549, abs/2208.08549, 2022. URL https://arXiv.org/abs/2208.08549.
  19. A unified approach to adaptive regularization in online and stochastic optimization. arXiv preprint arXiv:1706.06569, abs/1706.06569, 2017. URL https://arXiv.org/abs/1706.06569.
  20. Revisiting the Polyak step size. arXiv preprint arXiv:1905.00313, abs/1905.00313, 2019. URL https://arXiv.org/abs/1905.00313.
  21. Online learning with prior knowledge. In Nader H. Bshouty and Claudio Gentile, editors, Learning Theory, pages 499–513, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. ISBN 978-3-540-72927-3.
  22. Beyond convexity: Stochastic quasi-convex optimization. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1594–1602, 2015. URL https://proceedings.neurips.cc/paper/2015/hash/934815ad542a4a7c5e8a2dfa04fea9f5-Abstract.html.
  23. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/CVPR.2016.90.
  24. DoG is SGD’s best friend: A parameter-free dynamic step size schedule. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 14465–14499. PMLR, 23–29 Jul 2023.
  25. UniXGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 6257–6266, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/88855547570f7ff053fff7c54e5148cc-Abstract.html.
  26. Better theory for SGD in the nonconvex world. arXiv preprint arXiv:2002.03329, abs/2002.03329, 2020. URL https://arXiv.org/abs/2002.03329.
  27. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arXiv.org/abs/1412.6980.
  28. Alex Krizhevsky. Learning multiple layers of features from tiny images. pages 32–33, 2009. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
  29. Noise is not the main factor behind the gap between SGD and Adam on transformers, but sign descent might be. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=a65YK0cqH8g.
  30. Adaptive proximal algorithms for convex optimization under local Lipschitz continuity of the gradient. arXiv preprint arXiv:2301.04431, 2023.
  31. Kfir Y. Levy. Online to offline conversions, universality and adaptive minibatch sizes. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 1613–1622, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/ce5140df15d046a66883807d18d0264b-Abstract.html.
  32. Online adaptive methods, universality and acceleration. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 6501–6510, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/b0169350cd35566c47ba83c6ec1d6f82-Abstract.html.
  33. Convergence of adam under relaxed assumptions. arXiv preprint arXiv:2304.13972, abs/2304.13972, 2023. URL https://arXiv.org/abs/2304.13972.
  34. On the convergence of stochastic gradient descent with adaptive stepsizes. In Kamalika Chaudhuri and Masashi Sugiyama, editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 983–992. PMLR, 2019. URL http://proceedings.mlr.press/v89/li19c.html.
  35. On the convergence of AdaGrad(norm) on ℝdsuperscriptℝ𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT: Beyond convexity, non-asymptotic rate and acceleration. arXiv preprint arXiv:2209.14827, abs/2209.14827, 2022. URL https://arXiv.org/abs/2209.14827.
  36. Stochastic polyak step-size for SGD: an adaptive learning rate for fast convergence. In Arindam Banerjee and Kenji Fukumizu, editors, The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pages 1306–1314. PMLR, 2021. URL http://proceedings.mlr.press/v130/loizou21a.html.
  37. Adaptive gradient descent without descent. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 6702–6712. PMLR, 2020. URL http://proceedings.mlr.press/v119/malitsky20a.html.
  38. Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. In Maria-Florina Balcan, Vitaly Feldman, and Csaba Szepesvári, editors, Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014, volume 35 of JMLR Workshop and Conference Proceedings, pages 1020–1039. JMLR.org, 2014. URL http://proceedings.mlr.press/v35/mcmahan14.html.
  39. Lipschitz and comparator-norm adaptivity in online learning. In Conference on Learning Theory, pages 2858–2887. PMLR, 2020.
  40. Lipschitz adaptivity with multiple learning rates in online learning. In Conference on Learning Theory, pages 2490–2511. PMLR, 2019.
  41. Revisiting normalized gradient descent: Fast evasion of saddle points. IEEE Transactions on Automatic Control, 64(11):4818–4824, 2019. doi: 10.1109/TAC.2019.2914998.
  42. Yurii Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1-2):381–404, 2014. doi: 10.1007/s10107-014-0790-0. URL https://doi.org/10.1007/s10107-014-0790-0.
  43. Yurii Nesterov. Lectures on Convex Optimization. Springer Publishing Company, Incorporated, 2nd edition, 2018. ISBN 3319915770.
  44. Francesco Orabona. Dimension-free exponentiated gradient. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/file/7634ea65a4e6d9041cfd3f7de18e334a-Paper.pdf.
  45. Francesco Orabona. Normalized gradients for all. arXiv preprint arXiv:2308.05621, abs/2308.05621, 2023. URL https://arXiv.org/abs/2308.05621.
  46. ICML 2020 tutorial on parameter-free online optimization. ICML Tutorials, 2020. URL https://parameterfree.com/icml-tutorial/.
  47. Coin betting and parameter-free online learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 577–585, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  48. Parameter-free stochastic optimization of variationally coherent functions. arXiv preprint arXiv:2102.00236, abs/2102.00236, 2021. URL https://arXiv.org/abs/2102.00236.
  49. Training deep networks without learning rates through coin betting. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 2160–2170, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/7c82fab8c8f89124e2ce92984e04fb40-Abstract.html.
  50. Dynamics of sgd with stochastic polyak stepsizes: Truly adaptive variants and convergence to exact solution. arXiv preprint arXiv:2205.04583, abs/2205.04583, 2022. URL https://arXiv.org/abs/2205.04583.
  51. Toward understanding why Adam converges faster than SGD for transformers. OPT2023: 14th Annual Workshop on Optimization for Machine Learning, 2022. URL https://openreview.net/pdf?id=Sf1NlV2r6PO.
  52. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
  53. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, abs/2104.10350, 2021. URL https://arXiv.org/abs/2104.10350.
  54. Boris Polyak. Introduction to optimization. Optimization Software, 1987.
  55. Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964. ISSN 0041-5553. doi: https://doi.org/10.1016/0041-5553(64)90137-5. URL https://www.sciencedirect.com/science/article/pii/0041555364901375.
  56. The cost of training NLP models: a concise overview. arXiv preprint arXiv:2004.08900, abs/2004.08900, 2020. URL https://arXiv.org/abs/2004.08900.
  57. Naum Zuselevich Shor. Minimization methods for non-differentiable functions, volume 3. Springer Science & Business Media, 2012.
  58. Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.1556.
  59. No-regret algorithms for unconstrained online convex optimization. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2, NIPS’12, page 2402–2410, Red Hook, NY, USA, 2012. Curran Associates Inc.
  60. Sequential convergence of AdaGrad algorithm for smooth convex optimization. Operations Research Letters, 49(4):452–458, 2021.
  61. AdaGrad stepsizes: sharp convergence over nonconvex landscapes. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 6677–6686. PMLR, 2019. URL http://proceedings.mlr.press/v97/ward19a.html.
  62. The marginal value of adaptive gradient methods in machine learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4148–4158, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/81b3833e2504647f9d794f7d7b9bf341-Abstract.html.
  63. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020a. URL https://openreview.net/forum?id=BJgnXpVYwS.
  64. Why are adaptive methods good for attention models? In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020b. URL https://proceedings.neurips.cc/paper/2020/hash/b05b57f6add810d3b7490866d74c0053-Abstract.html.
Citations (17)

Summary

We haven't generated a summary for this paper yet.