Is Scaling Learned Optimizers Worth It? Evaluating The Value of VeLO's 4000 TPU Months (2310.18191v2)
Abstract: We analyze VeLO (versatile learned optimizer), the largest scale attempt to train a general purpose "foundational" optimizer to date. VeLO was trained on thousands of machine learning tasks using over 4000 TPU months with the goal of producing an optimizer capable of generalizing to new problems while being hyperparameter free, and outperforming industry standards such as Adam. We independently evaluate VeLO on the MLCommons optimizer benchmark suite. We find that, contrary to initial claims: (1) VeLO has a critical hyperparameter that needs problem-specific tuning, (2) VeLO does not necessarily outperform competitors in quality of solution found, and (3) VeLO is not faster than competing optimizers at reducing the training loss. These observations call into question VeLO's generality and the value of the investment in training it.
- Learning to learn by gradient descent by gradient descent. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 3988–3996, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
- Neural optimizer search with reinforcement learning. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 459–468. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/bello17a.html.
- On the optimization of a synaptic learning rule. In Conference on Optimality in Artificial and Biological Neural Networks.
- Learning to optimize: A primer and a benchmark. The Journal of Machine Learning Research, 23(1):8562–8620, 2022a.
- Scalable learning to optimize: A learned optimizer can train big models. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIII, page 389–405, Berlin, Heidelberg, 2022b. Springer-Verlag. ISBN 978-3-031-20049-6. doi: 10.1007/978-3-031-20050-2_23. URL https://doi.org/10.1007/978-3-031-20050-2_23.
- Symbolic discovery of optimization algorithms, 2023.
- On empirical comparisons of optimizers for deep learning, 2020. URL https://openreview.net/forum?id=HygrAR4tPS.
- Benchmarking neural network training algorithms, 2023.
- Benchmarking optimization software with performance profiles. Mathematical Programming, 91(2):201–213, Jan. 2002. doi: 10.1007/s101070100263. URL https://doi.org/10.1007/s101070100263.
- Transformer-based learned optimization, 2023.
- Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(09):5149–5169, sep 2022. ISSN 1939-3539. doi: 10.1109/TPAMI.2021.3079209.
- Discovering evolution strategies via meta-black-box optimization. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=mFDU0fP3EQH.
- Halo: Hardware-aware learning to optimize. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX, page 500–518, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58544-0. doi: 10.1007/978-3-030-58545-7_29. URL https://doi.org/10.1007/978-3-030-58545-7_29.
- Learning gradient descent: Better generalization and longer horizons. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 2247–2255. JMLR.org, 2017.
- Understanding and correcting pathologies in the training of learned optimizers. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4556–4565. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/metz19a.html.
- Gradients are not all you need, 2022a.
- Velo: Training versatile learned optimizers by scaling up, 2022b.
- T. Runarsson and M. Jonsson. Evolution and design of distributed learning rules. In 2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks. Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (Cat. No.00EX448). IEEE, 2000. doi: 10.1109/ecnn.2000.886220. URL https://doi.org/10.1109/ecnn.2000.886220.
- Evolution strategies as a scalable alternative to reinforcement learning, 2017.
- Descending through a crowded valley - benchmarking deep learning optimizers. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 9367–9376. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/schmidt21a.html.
- Optimizer benchmarking needs to account for hyperparameter tuning. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9036–9045. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/sivaprasad20a.html.
- Learned optimizers that scale and generalize. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 3751–3760. JMLR.org, 2017.
- Y. Xiong and C.-J. Hsieh. Improved adversarial training via learned optimizer, 2020.
- M-l2o: Towards generalizable learning-to-optimize by test-time fast self-adaptation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=s7oOe6cNRT8.