MADA: Meta-Adaptive Optimizers through hyper-gradient Descent (2401.08893v3)
Abstract: Following the introduction of Adam, several novel adaptive optimizers for deep learning have been proposed. These optimizers typically excel in some tasks but may not outperform Adam uniformly across all tasks. In this work, we introduce Meta-Adaptive Optimizers (MADA), a unified optimizer framework that can generalize several known optimizers and dynamically learn the most suitable one during training. The key idea in MADA is to parameterize the space of optimizers and dynamically search through it using hyper-gradient descent during training. We empirically compare MADA to other popular optimizers on vision and language tasks, and find that MADA consistently outperforms Adam and other popular optimizers, and is robust against sub-optimally tuned hyper-parameters. MADA achieves a greater validation performance improvement over Adam compared to other popular optimizers during GPT-2 training and fine-tuning. We also propose AVGrad, a modification of AMSGrad that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization. Finally, we provide a convergence analysis to show that parameterized interpolations of optimizers can improve their error bounds (up to constants), hinting at an advantage for meta-optimizers.
- Parameter Adaptation in Stochastic Optimization, page 111–134. Publications of the Newton Institute. Cambridge University Press, 1999. doi: 10.1017/CBO9780511569920.007.
- Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016.
- Online learning rate adaptation with hypergradient descent. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BkrsAzWAb.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Gradient descent: The ultimate optimizer. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=-Qp-3L-5ZdI.
- Closing the generalization gap of adaptive gradient methods in training deep neural networks, 2020.
- Symbolic discovery of optimization algorithms, 2023.
- A simple convergence proof of adam and adagrad. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=ZPQhzTSWA7.
- Timothy Dozat. Incorporating nesterov momentum into adam. 2016.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=6Tm1mposlrM.
- Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, pages 1165–1173. PMLR, 2017.
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights, 2021.
- Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9):5149–5169, 2021.
- Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks. http://karpathy.github.io/2015/05/21/rnn-effectiveness/, 2015.
- Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
- Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2023.
- On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgz2aEKDr.
- Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg3g2R9FX.
- Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pages 2113–2122. PMLR, 2015.
- Automatic differentiation in pytorch. 2017.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
- AutoML-zero: Evolving machine learning algorithms from scratch. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8007–8019. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/real20a.html.
- On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryQu7f-RZ.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Descending through a crowded valley - benchmarking deep learning optimizers. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 9367–9376. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/schmidt21a.html.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Learned optimizers that scale and generalize. In International conference on machine learning, pages 3751–3760. PMLR, 2017.
- Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models, 2023.
- Large batch training of convolutional networks, 2017.
- Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Syx4wnEtvH.
- Adaptive methods for nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/90365351ccc7437a1309dc64e4db32a3-Paper.pdf.
- Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18795–18806. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/d9d4f495e875a2e075a1a4a6e1b9770f-Paper.pdf.
- Kaan Ozkara (8 papers)
- Can Karakus (15 papers)
- Parameswaran Raman (11 papers)
- Mingyi Hong (172 papers)
- Shoham Sabach (27 papers)
- Branislav Kveton (98 papers)
- Volkan Cevher (216 papers)