Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates (1905.09997v5)

Published 24 May 2019 in cs.LG, math.OC, and stat.ML

Abstract: Recent works have shown that stochastic gradient descent (SGD) achieves the fast convergence rates of full-batch gradient descent for over-parameterized models satisfying certain interpolation conditions. However, the step-size used in these works depends on unknown quantities and SGD's practical performance heavily relies on the choice of this step-size. We propose to use line-search techniques to automatically set the step-size when training models that can interpolate the data. In the interpolation setting, we prove that SGD with a stochastic variant of the classic Armijo line-search attains the deterministic convergence rates for both convex and strongly-convex functions. Under additional assumptions, SGD with Armijo line-search is shown to achieve fast convergence for non-convex functions. Furthermore, we show that stochastic extra-gradient with a Lipschitz line-search attains linear convergence for an important class of non-convex functions and saddle-point problems satisfying interpolation. To improve the proposed methods' practical performance, we give heuristics to use larger step-sizes and acceleration. We compare the proposed algorithms against numerous optimization methods on standard classification tasks using both kernel methods and deep networks. The proposed methods result in competitive performance across all models and datasets, while being robust to the precise choices of hyper-parameters. For multi-class classification using deep networks, SGD with Armijo line-search results in both faster convergence and better generalization.

PDF Abstract

Summary of "Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates"

The paper "Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates" explores enhancements to Stochastic Gradient Descent (SGD) by utilizing line-search techniques to optimize step-size parameters without manual intervention. The research is grounded in the context of over-parametrized models which satisfy certain interpolation conditions—where the model can perfectly fit the training data. The authors propose a novel approach of integrating line-search methods, specifically stochastic variants of Armijo and Lipschitz conditions, to enable deterministic convergence rates for different function classes, including convex, strongly-convex, and non-convex functions.

Key Contributions

Line-Search for SGS and SEG: The authors refine SGD through line-search methods, particularly focusing on a stochastic adaptation of the Armijo condition to automatically set the step-size. This attempt aims to reconcile SGD's need for step-size selection with the deterministic convergence frameworks that line-search techniques can offer in deterministic settings.
Interpolation Condition: The paper leverages the interpolation condition, vital for modern over-parametrized models, to retain the convergence guarantees of full-batch gradient descent within the stochastic framework. The interpolation condition ensures that SGD can match the convergence rates of deterministic methods when models are expressive enough to fit the training data exactly.
Convergence Results:

Convergence proofs are provided for using Armijo and Lipschitz line-search methodologies under different assumptions: - Convex and Strongly-Convex Settings: The research establishes that convergence rates matching deterministic gradient descent can be attained without the specific knowledge of Lipschitz constants, due to automatic step-size adjustment. - Non-Convex Cases: Here, under assumptions such as the strong growth condition, the Armijo line-search enables convergence to stationary points at a rate of $O(1/T)$ , with certain constraints on maximum step-size.

Stochastic Extra-Gradient Method Application: By employing the Lipschitz line-search strategy, the paper extends beyond SGD to the stochastic extra-gradient method applicable in variational inequality contexts, especially for non-convex and bilinear min-max problems. This variation demonstrates stronger convergence in cases satisfying the restricted secant inequality.

Practical Implications

The insights from this work have significant implications in the domain of large-scale machine learning, where adjusting hyper-parameters such as learning rates manually is both impractical and ineffective. By embedding adaptive line-search techniques, models can achieve faster convergence and reduced sensitivity to hyper-parameter settings, making them robust for practical scenarios like multi-class classification tasks with deep neural networks. Moreover, the empirical evaluations show that these enhanced SGD adaptations consistently demonstrate competitive performance, often surpassing adaptive gradient methods on standard benchmarks.

This research spotlights potential future directions, such as exploring broader applications in non-convex optimization and developing stochastic momentum techniques under interpolation conditions to improve efficiency further. It presents a structured pathway to integrating deterministic optimization principles into stochastic methods prevalent in deep learning, streamlining optimization without additional computational overhead.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Sharan Vaswani (35 papers)
Aaron Mishkin (12 papers)
Issam Laradji (37 papers)
Mark Schmidt (74 papers)
Gauthier Gidel (76 papers)
Simon Lacoste-Julien (95 papers)

Citations (195)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos