Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Variational Stochastic Gradient Descent for Deep Neural Networks (2404.06549v1)

Published 9 Apr 2024 in cs.LG and stat.ML

Abstract: Optimizing deep neural networks is one of the main tasks in successful deep learning. Current state-of-the-art optimizers are adaptive gradient-based optimization methods such as Adam. Recently, there has been an increasing interest in formulating gradient-based optimizers in a probabilistic framework for better estimation of gradients and modeling uncertainties. Here, we propose to combine both approaches, resulting in the Variational Stochastic Gradient Descent (VSGD) optimizer. We model gradient updates as a probabilistic model and utilize stochastic variational inference (SVI) to derive an efficient and effective update rule. Further, we show how our VSGD method relates to other adaptive gradient-based optimizers like Adam. Lastly, we carry out experiments on two image classification datasets and four deep neural network architectures, where we show that VSGD outperforms Adam and SGD.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Convergence analysis of a momentum algorithm with adaptive step size for non convex optimization. arXiv preprint arXiv:1911.07596, 2019.
  2. Kalman filtering in stochastic gradient algorithms: construction of a stopping rule. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pp.  ii–709. IEEE, 2004.
  3. Bottou, L. Stochastic Gradient Descent Tricks, pp.  421–436. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-35289-8. doi: 10.1007/978-3-642-35289-8˙25. URL https://doi.org/10.1007/978-3-642-35289-8_25.
  4. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255, 2009a. doi: 10.1109/CVPR.2009.5206848.
  5. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009b.
  6. On the power of over-parametrization in neural networks with quadratic activation. In International conference on machine learning, pp.  1329–1338. PMLR, 2018.
  7. Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  8. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp.  249–256. JMLR Workshop and Conference Proceedings, 2010.
  9. Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  10. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp.  1026–1034, 2015.
  11. Stochastic variational inference. Journal of Machine Learning Research, 2013.
  12. Kalman, R. E. A new approach to linear filtering and prediction problems. 1960.
  13. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
  14. Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  2611–2620. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/khan18a.html.
  15. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
  16. Learning multiple layers of features from tiny images. 2009.
  17. Structured second-order methods via natural gradient descent. 2021.
  18. Simplifying momentum-based positive-definite submanifold optimization with applications to deep learning. In International Conference on Machine Learning, pp.  21026–21050. PMLR, 2023.
  19. When gaussian process meets big data: A review of scalable gps. IEEE transactions on neural networks and learning systems, 31(11):4405–4423, 2020a.
  20. Bayesian stochastic gradient descent for stochastic optimization with streaming input data. SIAM Journal on Optimization, 34(1):389–418, 2024.
  21. An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems, 33:18261–18271, 2020b.
  22. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  23. Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research, 18(134):1–35, 2017.
  24. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp.  2408–2417. PMLR, 2015.
  25. Is sgd a bayesian sampler? well, almost. The Journal of Machine Learning Research, 22(1):3579–3642, 2021.
  26. Revisiting normalized gradient descent: Fast evasion of saddle points. IEEE Transactions on Automatic Control, 64(11):4818–4824, 2019.
  27. Practical deep learning with bayesian principles. Advances in neural information processing systems, 32, 2019.
  28. A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research, 6:1939–1959, 2005.
  29. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
  30. A stochastic approximation method. The annals of mathematical statistics, pp.  400–407, 1951.
  31. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
  32. Schmidhuber, J. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
  33. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  34. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, volume 11006, pp.  369–386. SPIE, 2019.
  35. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.
  36. Sun, R.-Y. Optimization for deep learning: An overview. Journal of the Operations Research Society of China, 8(2):249–294, 2020.
  37. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp.  1139–1147. PMLR, 2013.
  38. Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 6, 2012.
  39. Patches are all you need? Transactions on Machine Learning Research, 2023.
  40. Vuckovic, J. Kalman gradient descent: Adaptive variance reduction in stochastic optimization. arXiv preprint arXiv:1810.12273, 2018.
  41. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp.  681–688. Citeseer, 2011.
  42. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1492–1500, 2017.
  43. Yang, X. Kalman optimizer for consistent gradient descent. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  3900–3904. IEEE, 2021.
  44. Bayesian interpretation of SGD as Ito process. arXiv preprint arXiv:1911.09011, 2019.
  45. Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
  46. A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp.  11127–11135, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Haotian Chen (30 papers)
  2. Anna Kuzina (13 papers)
  3. Babak Esmaeili (10 papers)
  4. Jakub M Tomczak (1 paper)

Summary

Introducing Variational Stochastic Gradient Descent: A Probabilistic Approach to Optimizing Deep Neural Networks

Introduction to VSGD

In pursuit of advancements within the field of deep learning optimization, this paper presents Variational Stochastic Gradient Descent (VSGD), a novel optimization technique that ingeniously combines traditional stochastic gradient descent (SGD) with principles of stochastic variational inference (SVI). VSGD emerges as a response to the optimization challenges posed by deep neural networks (DNNs), characterized by their vast parameter spaces and complex loss landscapes. By modeling gradient updates within a probabilistic framework, VSGD not only enhances the estimation accuracy of gradients but also efficiently manages uncertainty inherent in the optimization process.

Probabilistic Modeling of SGD

The foundation of VSGD lies in its unique perspective of treating the true gradient as a latent variable and the noisy gradient as observed data within a probabilistic model. This approach allows for the explicit modeling of gradient noise, thereby offering a more robust method for gradient estimation compared to traditional SGD. Specifically, VSGD adopts Gaussian distributions to represent the true and noisy gradients, with precision variables indicating the corresponding uncertainties. This modeling choice facilitates an efficient inference process using SVI, aiming to approximate the posterior distribution over the true gradient.

Theoretical Connections and Empirical Evaluation

A noteworthy aspect of the VSGD formulation is its theoretical linkage to established optimization techniques such as Adam and normalized-SGD. By situating VSGD within the broader landscape of gradient-based optimizers, the paper emphasizes its versatility and potential as a unifying framework that encapsulates several existing methods under specific parameter settings.

Empirical evaluations conducted on image classification tasks across various DNN architectures highlight VSGD's superior performance in terms of both convergence rates and generalization errors. Compared to prevalent optimizers such as Adam and SGD, VSGD demonstrates improved accuracy, showcasing its effectiveness in optimizing overparameterized DNNs.

VSGD Versus Constant VSGD

An extension introduced in this paper is the concept of Constant VSGD, which simplifies the original VSGD by assuming a constant variance ratio between the true and observed gradient distributions. This simplification fosters easier comparisons with Adam and SGD with momentum, illustrating the adaptability of the VSGD framework in catering to different optimization scenarios.

Future Directions and Broader Impact

Looking ahead, the research opens avenues for deeper exploration into the dependencies between gradients and the potential application of second-order momentum in VSGD updates. Additionally, the applicability of VSGD extends beyond image classification, promising advancements in fields such as generative modeling and reinforcement learning.

This paper contributes significantly to the optimization toolbox available for deep learning research. By bridging the gap between probabilistic modeling and gradient-based optimization, VSGD presents a compelling case for a more nuanced approach to training DNNs. As the community continues to tackle the complexities of deep learning architectures, VSGD stands out as a valuable asset in enhancing the effectiveness and efficiency of neural network training processes.

Youtube Logo Streamline Icon: https://streamlinehq.com