Variational Stochastic Gradient Descent for Deep Neural Networks (2404.06549v1)
Abstract: Optimizing deep neural networks is one of the main tasks in successful deep learning. Current state-of-the-art optimizers are adaptive gradient-based optimization methods such as Adam. Recently, there has been an increasing interest in formulating gradient-based optimizers in a probabilistic framework for better estimation of gradients and modeling uncertainties. Here, we propose to combine both approaches, resulting in the Variational Stochastic Gradient Descent (VSGD) optimizer. We model gradient updates as a probabilistic model and utilize stochastic variational inference (SVI) to derive an efficient and effective update rule. Further, we show how our VSGD method relates to other adaptive gradient-based optimizers like Adam. Lastly, we carry out experiments on two image classification datasets and four deep neural network architectures, where we show that VSGD outperforms Adam and SGD.
- Convergence analysis of a momentum algorithm with adaptive step size for non convex optimization. arXiv preprint arXiv:1911.07596, 2019.
- Kalman filtering in stochastic gradient algorithms: construction of a stopping rule. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pp. ii–709. IEEE, 2004.
- Bottou, L. Stochastic Gradient Descent Tricks, pp. 421–436. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-35289-8. doi: 10.1007/978-3-642-35289-8˙25. URL https://doi.org/10.1007/978-3-642-35289-8_25.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009a. doi: 10.1109/CVPR.2009.5206848.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009b.
- On the power of over-parametrization in neural networks with quadratic activation. In International conference on machine learning, pp. 1329–1338. PMLR, 2018.
- Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. JMLR Workshop and Conference Proceedings, 2010.
- Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
- Stochastic variational inference. Journal of Machine Learning Research, 2013.
- Kalman, R. E. A new approach to linear filtering and prediction problems. 1960.
- Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
- Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2611–2620. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/khan18a.html.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
- Learning multiple layers of features from tiny images. 2009.
- Structured second-order methods via natural gradient descent. 2021.
- Simplifying momentum-based positive-definite submanifold optimization with applications to deep learning. In International Conference on Machine Learning, pp. 21026–21050. PMLR, 2023.
- When gaussian process meets big data: A review of scalable gps. IEEE transactions on neural networks and learning systems, 31(11):4405–4423, 2020a.
- Bayesian stochastic gradient descent for stochastic optimization with streaming input data. SIAM Journal on Optimization, 34(1):389–418, 2024.
- An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems, 33:18261–18271, 2020b.
- SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research, 18(134):1–35, 2017.
- Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417. PMLR, 2015.
- Is sgd a bayesian sampler? well, almost. The Journal of Machine Learning Research, 22(1):3579–3642, 2021.
- Revisiting normalized gradient descent: Fast evasion of saddle points. IEEE Transactions on Automatic Control, 64(11):4818–4824, 2019.
- Practical deep learning with bayesian principles. Advances in neural information processing systems, 32, 2019.
- A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research, 6:1939–1959, 2005.
- On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
- A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
- Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
- Schmidhuber, J. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, volume 11006, pp. 369–386. SPIE, 2019.
- A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.
- Sun, R.-Y. Optimization for deep learning: An overview. Journal of the Operations Research Society of China, 8(2):249–294, 2020.
- On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. PMLR, 2013.
- Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 6, 2012.
- Patches are all you need? Transactions on Machine Learning Research, 2023.
- Vuckovic, J. Kalman gradient descent: Adaptive variance reduction in stochastic optimization. arXiv preprint arXiv:1810.12273, 2018.
- Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688. Citeseer, 2011.
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500, 2017.
- Yang, X. Kalman optimizer for consistent gradient descent. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3900–3904. IEEE, 2021.
- Bayesian interpretation of SGD as Ito process. arXiv preprint arXiv:1911.09011, 2019.
- Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
- A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 11127–11135, 2019.
- Haotian Chen (30 papers)
- Anna Kuzina (13 papers)
- Babak Esmaeili (10 papers)
- Jakub M Tomczak (1 paper)