Neural Optimizer Equation, Decay Function, and Learning Rate Schedule Joint Evolution (2404.06679v1)
Abstract: A major contributor to the quality of a deep learning model is the selection of the optimizer. We propose a new dual-joint search space in the realm of neural optimizer search (NOS), along with an integrity check, to automate the process of finding deep learning optimizers. Our dual-joint search space simultaneously allows for the optimization of not only the update equation, but also internal decay functions and learning rate schedules for optimizers. We search the space using our proposed mutation-only, particle-based genetic algorithm able to be massively parallelized for our domain-specific problem. We evaluate our candidate optimizers on the CIFAR-10 dataset using a small ConvNet. To assess generalization, the final optimizers were then transferred to large-scale image classification on CIFAR- 100 and TinyImageNet, while also being fine-tuned on Flowers102, Cars196, and Caltech101 using EfficientNetV2Small. We found multiple optimizers, learning rate schedules, and Adam variants that outperformed Adam, as well as other standard deep learning optimizers, across the image classification tasks.
- TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
- Neural optimizer search with reinforcement learning. In International Conference on Machine Learning, pp. 459–468. PMLR, 2017.
- Evolutionary optimization of deep learning activation functions. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO ’20, pp. 289–296, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450371285. doi: 10.1145/3377930.3389841. URL https://doi.org/10.1145/3377930.3389841.
- Evolving adaptive neural network optimizers for image classification. In Eric Medvet, Gisele Pappa, and Bing Xue (eds.), Genetic Programming, pp. 3–18, Cham, 2022. Springer International Publishing.
- Demon: Improved neural network training with momentum decay. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3958–3962, 2022a. doi: 10.1109/ICASSP43922.2022.9746839.
- Evolved optimizer for vision. In First Conference on Automated Machine Learning (Late-Breaking Workshop), 2022b.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703, 2020.
- T. Dozat. Incorporating nesterov momentum into adam. ICLR Workshops, 2016.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011.
- Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997–2017, 2019.
- Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Workshop, 2004.
- Stochastic gradient methods with layer-wise adaptive moments for training of deep networks. CoRR, abs/1905.11286, 2019. URL http://arxiv.org/abs/1905.11286.
- Deep Learning. MIT Press, 2016.
- A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. arXiv preprint arXiv:1810.13243, 2018.
- AutoML: A survey of the state-of-the-art. Knowledge-Based Systems, 212:106622, 2021.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 448–456, Lille, France, 07–09 Jul 2015. PMLR.
- Accelerating stochastic gradient descent for least squares regression, 2017. URL https://arxiv.org/abs/1704.08227.
- Adam: A method for stochastic optimization. ICLR, 2017.
- 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
- Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, Ontario, 2009.
- Ya Le and Xuan S. Yang. Tiny imagenet visual recognition challenge. 2015.
- Evolving normalization-activation layers. Advances in Neural Information Processing Systems, 33:13539–13550, 2020.
- Loss function discovery for object detection via convergence-simulation driven search. arXiv preprint arXiv:2102.04700, 2021.
- Decoupled weight decay regularization. ICLR, 2019.
- Aggregated momentum: Stability through passive damping. CoRR, abs/1804.00325, 2018. URL http://arxiv.org/abs/1804.00325.
- Quasi-hyperbolic momentum and adam for deep learning. CoRR, abs/1810.06801, 2018. URL http://arxiv.org/abs/1810.06801.
- Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993. URL https://aclanthology.org/J93-2004.
- Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
- Neural loss function evolution for large-scale image classifier convolutional neural networks. arXiv preprint arXiv, 2024.
- YU. E. Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2)𝑜1superscript𝑘2o(1/k^{2})italic_o ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). 1983.
- M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
- On the convergence proof of amsgrad and a new version. CoRR, abs/1904.03590, 2019. URL http://arxiv.org/abs/1904.03590.
- Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4780–4789, 2019.
- Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235, 2018. URL http://arxiv.org/abs/1804.04235.
- Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pp. 464–472. IEEE, 2017.
- Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, volume 11006, pp. 369–386. SPIE, 2019.
- Efficientnetv2: Smaller models and faster training. In International Conference on Machine Learning, pp. 10096–10106. PMLR, 2021.
- Calibrating the learning rate for adaptive gradient methods to improve generalization performance. CoRR, abs/1908.00700, 2019. URL http://arxiv.org/abs/1908.00700.
- Learning an adaptive learning rate schedule. arXiv preprint arXiv:1909.09712, 2019.
- Adaptive methods for nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/90365351ccc7437a1309dc64e4db32a3-Paper.pdf.
- Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701, 2012. URL http://arxiv.org/abs/1212.5701.
- Yellowfin and the art of momentum tuning, 2017. URL https://arxiv.org/abs/1706.03471.
- Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.
- Stochastic normalized gradient descent with momentum for large batch training, 2020. URL https://arxiv.org/abs/2007.13985.
- Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. CoRR, abs/2010.07468, 2020. URL https://arxiv.org/abs/2010.07468.
- Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
- Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.