Gradient Descent, Stochastic Optimization, and Other Tales (2205.00832v2)
Abstract: The goal of this paper is to debunk and dispel the magic behind black-box optimizers and stochastic optimizers. It aims to build a solid foundation on how and why the techniques work. This manuscript crystallizes this knowledge by deriving from simple intuitions, the mathematics behind the strategies. This tutorial doesn't shy away from addressing both the formal and informal aspects of gradient descent and stochastic optimization methods. By doing so, it hopes to provide readers with a deeper understanding of these techniques as well as the when, the how and the why of applying these algorithms. Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize machine learning tasks. Its stochastic version receives attention in recent years, and this is particularly true for optimizing deep neural networks. In deep neural networks, the gradient followed by a single sample or a batch of samples is employed to save computational resources and escape from saddle points. In 1951, Robbins and Monro published \textit{A stochastic approximation method}, one of the first modern treatments on stochastic optimization that estimates local gradients with a new batch of samples. And now, stochastic optimization has become a core technology in machine learning, largely due to the development of the back propagation algorithm in fitting a neural network. The sole aim of this article is to give a self-contained introduction to concepts and mathematical tools in gradient descent and stochastic optimization.
- Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
- Amir Beck. First-order methods in optimization. SIAM, 2017.
- Improving the convergence of back-propagation learning with. 1988.
- Introduction to applied linear algebra: vectors, matrices, and least squares. Cambridge university press, 2018.
- Equilibrated adaptive learning rates for non-convex optimization. Advances in neural information processing systems, 28, 2015.
- Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems, 27, 2014.
- Timothy Dozat. Incorporating nesterov momentum into adam. 2016.
- Gradient descent can take exponential time to escape saddle points. Advances in neural information processing systems, 30, 2017.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- Function minimization by conjugate gradients. The computer journal, 7(2):149–154, 1964.
- Gabriel Goh. Why momentum really works. Distill, 2(4):e6, 2017.
- Deep learning. MIT press, 2016.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. Ieee, 2013.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012a.
- Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8):2, 2012b.
- Neural networks for machine learning lecture 6c the momentum method. Cited on, 14(8):2, 2012c.
- Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146, 2018.
- Snapshot ensembles: Train 1, get m for free. arXiv preprint arXiv:1704.00109, 2017.
- Arieh Iserles. A first course in the numerical analysis of differential equations. Number 44. Cambridge university press, 2009.
- How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724–1732. PMLR, 2017.
- Perry J Kaufman. Smarter trading, 1995.
- Perry J Kaufman. Trading Systems and Methods,+ Website, volume 591. John Wiley & Sons, 2013.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
- Yann LeCun. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
- Kenneth Levenberg. A method for the solution of certain non-linear problems in least squares. Quarterly of applied mathematics, 2(2):164–168, 1944.
- SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Jun Lu. Numerical matrix decomposition. arXiv preprint arXiv:2107.02579, 2021.
- Jun Lu. AdaSmooth: An adaptive learning rate method based on effective ratio. arXiv preprint arXiv:2204.00825, Proceedings of ICSADL, 2022a.
- Jun Lu. Exploring classic quantitative strategies. arXiv preprint arXiv:2202.11309, 2022b.
- Jun Lu. Matrix decomposition and applications. arXiv preprint arXiv:2201.00145, Eliva Press, 2022c.
- Reducing overestimating and underestimating volatility via the augmented blending-ARCH model. Applied Economics and Finance, 9(2):48–59, 2022.
- Donald W Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963.
- Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9190–9200, 2019.
- Numerical optimization. Springer, 1999.
- Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 15(3):715–732, 2015.
- Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
- Training tips for the transformer model. arXiv preprint arXiv:1804.00247, 2018.
- Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
- A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- Sam Roweis. Levenberg-Marquardt optimization. Notes, University Of Toronto, 1996.
- Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
- Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
- Heinz Rutishauser. Theory of gradient methods. In Refined iterative methods for computation of the solution and the eigenvalues of self-adjoint boundary value problems, pages 24–49. Springer, 1959.
- Jonathan Richard Shewchuk et al. An introduction to the conjugate gradient method without the agonizing pain, 1994.
- Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017.
- Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, volume 11006, page 1100612. International Society for Optics and Photonics, 2019.
- Dropout: A simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
- Numerical linear algebra, volume 50. Siam, 1997.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Kenneth S Williams. The n th power of a 2×\times× 2 matrix. Mathematics Magazine, 65(5):336–336, 1992.
- Matthew D Zeiler. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.