Signal Processing Meets SGD: From Momentum to Filter (2311.02818v7)
Abstract: In deep learning, stochastic gradient descent (SGD) and its momentum-based variants are widely used for optimization. However, the internal dynamics of these methods remain underexplored. In this paper, we analyze gradient behavior through a signal processing lens, isolating key factors that influence gradient updates and revealing a critical limitation: momentum techniques lack the flexibility to adequately balance bias and variance components in gradients, resulting in gradient estimation inaccuracies. To address this issue, we introduce a novel method SGDF (SGD with Filter) based on Wiener Filter principles, which derives an optimal time-varying gain to refine gradient updates by minimizing the mean square error in gradient estimation. This method yields an optimal first-order gradient estimate, effectively balancing noise reduction and signal preservation. Furthermore, our approach could extend to adaptive optimizers, enhancing their generalization potential. Empirical results show that SGDF achieves superior convergence and generalization compared to traditional momentum methods, and performs competitively with state-of-the-art optimizers.
- Wasserstein gan. 2017.
- Dissecting adam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning, pp. 404–413. PMLR, 2018.
- Scaling learning algorithms towards ai. 2007.
- On the distance between two neural networks and the stability of learning. Advances in Neural Information Processing Systems, 33:21370–21381, 2020.
- Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
- Improving generalization in federated learning by seeking flat minima. ArXiv, 2022.
- On the generalization of learning algorithms that do not converge. Advances in Neural Information Processing Systems, 35:34241–34257, 2022.
- Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763, 2018.
- Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675, 2023.
- Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. MIT Press, 2014.
- Koala: A kalman optimization algorithm with loss adaptivity. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 6471–6479, 2022.
- Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646–1654, 2014.
- Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
- Dozat, T. Incorporating nesterov momentum into adam. ICLR Workshop, 2016.
- On the power of over-parametrization in neural networks with quadratic activation. 2018.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011.
- The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
- Foret, P. et al. Sharpness-aware minimization for efficiently improving generalization. In ICLR, 2021. spotlight.
- Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1–58, 2014.
- Deep learning. MIT Press, 2016.
- Train faster, generalize better: Stability of stochastic gradient descent. Mathematics, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Gans trained by a two time-scale update rule converge to a nash equilibrium. 2017.
- Hinton, G. Neural networks for machine learning. Coursera video lectures, 2012. URL https://www.coursera.org/learn/neural-networks. Lecture 6.5 - rmsprop: Divide the gradient by a running average of its recent magnitude.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
- Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013.
- Kalman, R. E. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 1960.
- Kay, S. M. Fundamentals of Statistical Signal Processing: Estimation Theory, volume 1. Prentice-Hall, Inc., 1993.
- On large-batch training for deep learning: Generalization gap and sharp minima. 2016.
- Keskar, N. S. et al. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Learning multiple layers of features from tiny images. 2009.
- Microsoft coco: Common objects in context. European Conference on Computer Vision (ECCV), 2014.
- Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
- On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- On the theoretical properties of noise correlation in stochastic optimization. Advances in Neural Information Processing Systems, 35:14261–14273, 2022.
- Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843, 2019.
- Monro, R. S. a stochastic approximation method. Annals of Mathematical Statistics, 22(3):400–407, 1951.
- Ollivier, Y. The extended kalman filter is a natural gradient descent in trajectory space. arXiv: Optimization and Control, 2019.
- On the convergence of adam and beyond. 2019.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Neural Information Processing Systems (NIPS), 2015.
- Ruder, S. An overview of gradient descent optimization algorithms. 2016.
- Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
- Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83–112, 2017.
- Very deep convolutional networks for large-scale image recognition. Computer Science, 2014.
- On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. PMLR, 2013.
- Vuckovic, J. Kalman gradient descent: Adaptive variance reduction in stochastic optimization. ArXiv, 2018.
- Provable adaptivity in adam. arXiv preprint arXiv:2208.09900, 2022.
- Wiener, N. The extrapolation, interpolation and smoothing of stationary time series, with engineering applications. Journal of the Royal Statistical Society Series A (General), 1950.
- The marginal value of adaptive gradient methods in machine learning. 2017.
- Adai: Separating the effects of adaptive learning rate and momentum inertia. arXiv preprint arXiv:2006.15815, 2020.
- On the power-law spectrum in deep learning: A bridge to protein science. 2022.
- On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp. 10524–10533. PMLR, 2020.
- Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. Physical review letters, 2022.
- Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. Physical Review Letters, 130(23):237101, 2023.
- Hessian-based analysis of large batch training and robustness to adversaries. 2018.
- Pyhessian: Neural networks through the lens of the hessian. In International Conference on Big Data, 2020a.
- Adahessian: An adaptive second order optimizer for machine learning. arXiv preprint arXiv:2006.00719, 2020b.
- Adaptive methods for nonconvex optimization. Advances in neural information processing systems, 31, 2018.
- Zeiler, M. D. Adadelta: An adaptive learning rate method. arXiv e-prints, 2012.
- Understanding deep learning requires rethinking generalization. 2016.
- Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.
- Gradient norm aware minimization seeks first-order flatness and improves generalization. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in neural information processing systems, 33:18795–18806, 2020.