Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron (1810.07288v3)

Published 16 Oct 2018 in cs.LG and stat.ML

Abstract: Modern machine learning focuses on highly expressive models that are able to fit or interpolate the data completely, resulting in zero training loss. For such models, we show that the stochastic gradients of common loss functions satisfy a strong growth condition. Under this condition, we prove that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions. We also show that this condition implies that SGD can find a first-order stationary point as efficiently as full gradient descent in non-convex settings. Under interpolation, we further show that all smooth loss functions with a finite-sum structure satisfy a weaker growth condition. Given this weaker condition, we prove that SGD with a constant step-size attains the deterministic convergence rate in both the strongly-convex and convex settings. Under additional assumptions, the above results enable us to prove an O(1/k²⁾ mistake bound for k iterations of a stochastic perceptron algorithm using the squared-hinge loss. Finally, we validate our theoretical findings with experiments on synthetic and real datasets.

PDF Abstract

Fast and Faster Convergence of SGD for Over-Parameterized Models

This paper, authored by Vaswani, Bach, and Schmidt, provides an extensive analytical treatise on the convergence properties of stochastic gradient descent (SGD), particularly when applied to over-parameterized models. The authors focus on a very relevant setting in contemporary machine learning, where models often have ample parameterization capacity, enabling them to fit training data precisely—particularly with zero training loss.

The central claim of the paper hinges on the stochastic gradient meeting a strong growth condition (SGC). Under this assumption, it is demonstrated that a constant step-size SGD with Nesterov acceleration achieves convergence rates akin to deterministic accelerated methods in both convex and strongly-convex domains. This brings significant clarity and a theoretical underpinning for the empirical success reported in the community regarding Nesterov's accelerated gradient in SGD.

Moreover, the authors extend the exploration into non-convex settings, proving that the SGC warrants SGD to find a first-order stationary point as effectively as its deterministic gradient descent counterpart. This is a noteworthy contribution, as it is the first to establish such accelerated and non-convex rates under interpolation assumptions for SGD.

To broaden the application scope, the paper also introduces a more achievable weak growth condition (WGC), which ensures deterministic convergence in smooth scenarios without relying heavily on stringent assumptions. This insight makes SGD applicable to a wider array of functions, encompassing those structured with finite sums, such as popular loss functions in machine learning applications (e.g., squared and squared-hinge loss).

In practical terms, the authors illustrate the applicability of the WGC by showing that any smooth finite-sum problem under interpolation naturally satisfies this condition, thus offering robustness to models engaging in interpolation—common in tasks involving deep neural networks or models using non-parametric settings.

One intriguing result is the validation of a particular advantage afforded by the SGC to the perceptron algorithm, demonstrating that a modification using squared-hinge loss can significantly enhance its mistake bound, potentially reaching $O(1/k^2)$ . This reinforcement aligns with historical understandings of linear separability yet propounds an enhanced dependency on the iteration count compared against traditional bounds.

The experimental section complements the theoretical advancements, displaying empirical evidence across synthetic and real datasets. The authors compared SGD and its accelerated variants through a series of methodical evaluations, reinforcing the theoretical findings and demonstrating conditions under which accelerated methods outperform or align with traditional SGD despite using fewer computational resources per update.

Taken together, this work contributes a profound theoretical foundation and practical insights into the convergence performance of SGD in high-capacity models capable of data interpolation. It further opens intriguing avenues for future research involving line-search techniques and provides potential scaffolds for developing adaptive methods catering to parameter-rich learning models underlined by simple, yet powerful, convergence principles.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Sharan Vaswani (35 papers)
Francis Bach (249 papers)
Mark Schmidt (74 papers)

Citations (288)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos