Bolstering Stochastic Gradient Descent with Model Building (2111.07058v4)
Abstract: Stochastic gradient descent method and its variants constitute the core optimization algorithms that achieve good convergence rates for solving machine learning problems. These rates are obtained especially when these algorithms are fine-tuned for the application at hand. Although this tuning process can require large computational costs, recent work has shown that these costs can be reduced by line search methods that iteratively adjust the step length. We propose an alternative approach to stochastic line search by using a new algorithm based on forward step model building. This model building step incorporates second-order information that allows adjusting not only the step length but also the search direction. Noting that deep learning model parameters come in groups (layers of tensors), our method builds its model and calculates a new step for each parameter group. This novel diagonalization approach makes the selected step lengths adaptive. We provide convergence rate analysis, and experimentally show that the proposed algorithm achieves faster convergence and better generalization in well-known test problems. More precisely, SMB requires less tuning, and shows comparable performance to other adaptive methods.
- The importance of better models in stochastic optimization. Proceedings of the National Academy of Sciences, 116(46):22924–22930.
- Coupling adaptive batch sizes with learning rates. In Elidan, G., Kersting, K., and Ihler, A., editors, Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017. AUAI Press.
- Adaptive sampling strategies for stochastic optimization. SIAM Journal on Optimization, 28(4):3312–3343.
- Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311.
- Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127–155.
- A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031.
- Convergence analysis of accelerated stochastic gradient descent under the growth condition. Mathematics of Operations Research. Available online: https://doi.org/10.1287/moor.2021.0293.
- SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, page 1646–1654, Cambridge, MA, USA. MIT Press.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
- Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, Los Alamitos, CA, USA. IEEE Computer Society.
- Better theory for SGD in the nonconvex world. ArXiv, abs/2002.03329.
- Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y., editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Convergence under Lipschitz smoothness of ease-controlled random reshuffling gradient algorithms. ArXiv, abs/2212.01848.
- Probabilistic line searches for stochastic optimization. The Journal of Machine Learning Research, 18(1):4262–4320.
- Server-side stepsizes and sampling without replacement provably help in federated optimization. ArXiv, abs/2201.11066.
- RES: Regularized stochastic BFGS algorithm. IEEE Transactions on Signal Processing, 62(23):6089–6104.
- Parabolic approximation line search for DNNs. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 5405–5416. Curran Associates, Inc.
- An alternative globalization strategy for unconstrained optimization. Optimization, 67(3):377–392.
- A stochastic line search method with expected complexity analysis. SIAM Journal on Optimization, 30(1):349–376.
- A stochastic gradient method with an exponential convergence rate for finite training sets. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2, NIPS’12, page 2663–2671, Red Hook, NY, USA. Curran Associates Inc.
- A stochastic quasi-newton method for online convex optimization. In Meila, M. and Shen, X., editors, Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, volume 2 of Proceedings of Machine Learning Research, pages 436–443, San Juan, Puerto Rico. PMLR.
- Tadić, V. (1997). Stochastic gradient algorithm with random truncations. European Journal of Operational Research, 101(2):261–284.
- Painless stochastic gradient: interpolation, line-search, and convergence rates. Curran Associates Inc., Red Hook, NY, USA.
- Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27(2):927–956.
- S. Ilker Birbil (23 papers)
- Ozgur Martin (1 paper)
- Gonenc Onay (1 paper)
- Figen Oztoprak (7 papers)