QLABGrad: a Hyperparameter-Free and Convergence-Guaranteed Scheme for Deep Learning (2302.00252v2)
Abstract: The learning rate is a critical hyperparameter for deep learning tasks since it determines the extent to which the model parameters are updated during the learning course. However, the choice of learning rates typically depends on empirical judgment, which may not result in satisfactory outcomes without intensive try-and-error experiments. In this study, we propose a novel learning rate adaptation scheme called QLABGrad. Without any user-specified hyperparameter, QLABGrad automatically determines the learning rate by optimizing the Quadratic Loss Approximation-Based (QLAB) function for a given gradient descent direction, where only one extra forward propagation is required. We theoretically prove the convergence of QLABGrad with a smooth Lipschitz condition on the loss function. Experiment results on multiple architectures, including MLP, CNN, and ResNet, on MNIST, CIFAR10, and ImageNet datasets, demonstrate that QLABGrad outperforms various competing schemes for deep learning.
- Aggarwal, C. C. 2018. Neural Networks and Deep learning. Springer.
- Parameter adaptation in Stochastic Optimization. In Saad, J., ed., On–line Learning in Neural Networks, 111–134. Cambridge University Press.
- Hyperparameter Tuning for Machine and Deep Learning with R: A Practical Guide. Springer Nature.
- Online learning rate adaptation with hypergradient descent. arXiv preprint arXiv:1703.04782.
- ONLINE LEARNING RATE ADAPTATION WITH HYPERGRADIENT DESCENT. In Proceedings of the Sixth International Conference on Learning Representations (ICLR 2018). Vancouver, Canada.
- Parameters estimation in Engineering and Sciences. John Wiley and Sons.
- Random search for hyper-parameter optimization. Journal of machine learning research, 13(2).
- Convex Optimization. Cambridge University Press.
- Learning to Optimize: A Primer and A Benchmark. Journal of Machine Learning Research, 23: 1–59.
- Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12: 2121–2159.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7).
- Hyperparameter optimization. Automated machine learning: Methods, systems, challenges, 3–33.
- Deep learning. MIT Press.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Asynchronous mini-batch gradient descent with variance reduction for non-convex optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31.
- Jacobs, R. 1988. Increased rates of convergence through learning rate adaptation. Neural networks, 1: 295–307.
- Adaptive hierarchical hyper‑gradient descent. International Journal of Machine Learning and Cybernetics, 13: 3785–3805.
- Mini-batch gradient descent: Faster convergence under data sparsity. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), 2880–2887. IEEE.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Learning multiple layers of features from tiny images.
- Tiny imagenet visual recognition challenge. CS 231N, 7(7): 3.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278–2324.
- Nesterov, N. 1983. A method of Solving a convex programming problem with convergence rate o(1/k2)𝑜1superscript𝑘2o(1/k^{2})italic_o ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Soviet Mathematics Doklady, 27(2): 372–376.
- Nesterov, Y. 2003. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.
- Nestrove, Y. E. 1983. A method of solving a convex programming problem with convergence rate O(1/k2)𝑂1superscript𝑘2O(1/k^{2})italic_O ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Soviet Mathematics Doklady, 27(2): 372–376.
- Learning rate adaptation in stochastic gradient descent. Advances in Convex Analysis and Global Optimization, 586–591.
- Polyak, B. T. 1964. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5): 1–17.
- L4: Practical loss-based stepsize adaptation for deep learning. In Proceedings of the 132nd Conference on Neural Information Processing Systems (NeurIPS 2018). Montréal, Canada.
- Ruder, S. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.
- Rudin, W. 1953. Principles of mathematical analysis.
- Escaping saddle points with adaptive gradient methods. In International Conference on Machine Learning, 5956–5965. PMLR.
- RMSprop: Divide the Gradient by The Running Average of its recent magnitude. Technical report, COURSERA: Neural Networks for Machine Learning.
- Zeiler, M. D. 2012. ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701.
- Automatic, dynamic, and nearly optimal learning rate specification via local quadratic approximation. Neural Networks, 141: 11–29.