Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence (2002.10542v3)

Published 24 Feb 2020 in math.OC, cs.LG, and stat.ML

Abstract: We propose a stochastic variant of the classical Polyak step-size (Polyak, 1987) commonly used in the subgradient method. Although computing the Polyak step-size requires knowledge of the optimal function values, this information is readily available for typical modern machine learning applications. Consequently, the proposed stochastic Polyak step-size (SPS) is an attractive choice for setting the learning rate for stochastic gradient descent (SGD). We provide theoretical convergence guarantees for SGD equipped with SPS in different settings, including strongly convex, convex and non-convex functions. Furthermore, our analysis results in novel convergence guarantees for SGD with a constant step-size. We show that SPS is particularly effective when training over-parameterized models capable of interpolating the training data. In this setting, we prove that SPS enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problem-dependent constants or additional computational overhead. We experimentally validate our theoretical results via extensive experiments on synthetic and real datasets. We demonstrate the strong performance of SGD with SPS compared to state-of-the-art optimization methods when training over-parameterized models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Nicolas Loizou (38 papers)
  2. Sharan Vaswani (35 papers)
  3. Issam Laradji (37 papers)
  4. Simon Lacoste-Julien (95 papers)
Citations (169)

Summary

Overview of Stochastic Polyak Step-size for SGD

The paper examines a stochastic variant of the Polyak step-size for stochastic gradient descent (SGD), offering an adaptive learning rate that is beneficial for achieving fast convergence. The motivation behind this paper is the prevalent necessity to effectively train over-parameterized models in modern machine learning tasks, where typical constants are less suitable due to unpredictable dynamics. The authors posit that this adaptive methodology is particularly well-suited for optimizing over-parameterized models that can interpolate training data, achieving significant convergence without incorporating problem-dependent constants and adding computational overhead.

The authors begin by addressing the step-size selection for SGD—a quintessential factor influencing convergence. Over time, various methods for selecting the step-size, such as constant step-sizes, decreasing step-sizes, and adaptive methods, have been explored. Inspired by the classical Polyak step-size used in deterministic subgradient methods, a stochastic Polyak step-size (SPS) is proposed. They assert that knowledge of fif_i^*, required for SPS computation, is obtainable in standard machine learning applications, presenting a case for SPS as an attractive choice for SGD.

In terms of theoretical contributions, the authors deliver convergence guarantees across different settings, including strongly convex, convex, non-convex functions, and those satisfying the Polyak-Lojasiewicz (PL) condition. One notable divergence from traditional analyses is that the variance of stochastic gradient is not assumed to be bounded—a common assumption in this domain—offering robust convergence analysis without bounding gradients. This is presented as a distinctive advantage in their approach where the stochastic Polyak step-size naturally bounds the gradient norms.

The paper further investigates SGD within the field of over-parametrized models, a common challenge in contemporary AI systems, particularly deep neural networks. Here, the authors demonstrate that SPS enables SGD to converge the solution rapidly under the interpolation condition—a prevalent condition in modern datasets without needing any additional problem-dependent specifics. This condition greatly simplifies the application and reduces computational overhead.

Experimentally, the authors validate their theoretical assertions through various synthetic and real dataset scenarios, delineating robust performance of SGD outfitted with SPS against leading optimization methodologies when working with over-parametrized models. Their findings indicate that the performance of SPS is notably competitive, often surpassing state-of-the-art methods under experimental conditions.

The practical implications of this paper are multifaceted. By introducing an adaptive learning rate that leverages straightforward computation of the optimal function values, the research aims to enhance SGD's efficacy in complex, real-world machine learning tasks. Additionally, its theoretical foundations suggest promising future developments, such as applying SPS in specialized fields like decentralized or distributed computing frameworks.

In summary, this paper contributes a nuanced analysis of the stochastic Polyak step-size implementation for SGD, providing technical insight into its advantages for adaptive learning rate calibration. This work suggests potential for increased optimization efficiency when addressing modern challenges in AI model training, expressive of a broader interest in advancing SGD methodologies.