Fast and Faster Convergence of SGD for Over-Parameterized Models
This paper, authored by Vaswani, Bach, and Schmidt, provides an extensive analytical treatise on the convergence properties of stochastic gradient descent (SGD), particularly when applied to over-parameterized models. The authors focus on a very relevant setting in contemporary machine learning, where models often have ample parameterization capacity, enabling them to fit training data precisely—particularly with zero training loss.
The central claim of the paper hinges on the stochastic gradient meeting a strong growth condition (SGC). Under this assumption, it is demonstrated that a constant step-size SGD with Nesterov acceleration achieves convergence rates akin to deterministic accelerated methods in both convex and strongly-convex domains. This brings significant clarity and a theoretical underpinning for the empirical success reported in the community regarding Nesterov's accelerated gradient in SGD.
Moreover, the authors extend the exploration into non-convex settings, proving that the SGC warrants SGD to find a first-order stationary point as effectively as its deterministic gradient descent counterpart. This is a noteworthy contribution, as it is the first to establish such accelerated and non-convex rates under interpolation assumptions for SGD.
To broaden the application scope, the paper also introduces a more achievable weak growth condition (WGC), which ensures deterministic convergence in smooth scenarios without relying heavily on stringent assumptions. This insight makes SGD applicable to a wider array of functions, encompassing those structured with finite sums, such as popular loss functions in machine learning applications (e.g., squared and squared-hinge loss).
In practical terms, the authors illustrate the applicability of the WGC by showing that any smooth finite-sum problem under interpolation naturally satisfies this condition, thus offering robustness to models engaging in interpolation—common in tasks involving deep neural networks or models using non-parametric settings.
One intriguing result is the validation of a particular advantage afforded by the SGC to the perceptron algorithm, demonstrating that a modification using squared-hinge loss can significantly enhance its mistake bound, potentially reaching . This reinforcement aligns with historical understandings of linear separability yet propounds an enhanced dependency on the iteration count compared against traditional bounds.
The experimental section complements the theoretical advancements, displaying empirical evidence across synthetic and real datasets. The authors compared SGD and its accelerated variants through a series of methodical evaluations, reinforcing the theoretical findings and demonstrating conditions under which accelerated methods outperform or align with traditional SGD despite using fewer computational resources per update.
Taken together, this work contributes a profound theoretical foundation and practical insights into the convergence performance of SGD in high-capacity models capable of data interpolation. It further opens intriguing avenues for future research involving line-search techniques and provides potential scaffolds for developing adaptive methods catering to parameter-rich learning models underlined by simple, yet powerful, convergence principles.