New Perspectives on the Polyak Stepsize: Surrogate Functions and Negative Results (2505.20219v1)

Published 26 May 2025 in math.OC and cs.LG

Abstract: The Polyak stepsize has been proven to be a fundamental stepsize in convex optimization, giving near optimal gradient descent rates across a wide range of assumptions. The universality of the Polyak stepsize has also inspired many stochastic variants, with theoretical guarantees and strong empirical performance. Despite the many theoretical results, our understanding of the convergence properties and shortcomings of the Polyak stepsize or its variants is both incomplete and fractured across different analyses. We propose a new, unified, and simple perspective for the Polyak stepsize and its variants as gradient descent on a surrogate loss. We show that each variant is equivalent to minimize a surrogate function with stepsizes that adapt to a guaranteed local curvature. Our general surrogate loss perspective is then used to provide a unified analysis of existing variants across different assumptions. Moreover, we show a number of negative results proving that the non-convergence results in some of the upper bounds is indeed real.

Summary

The paper provides a unified surrogate framework by interpreting the Polyak stepsize as gradient descent on a squared loss function, explaining its parameter-free adaptivity.
It recovers standard convergence rates under various assumptions, translating O(1/T) rates for the surrogate function into O(1/√T) convergence for the original objective.
The study highlights fundamental limitations, showing that without interpolation the method may become unstable and converge only to a neighborhood of the optimum.

The paper "New Perspectives on the Polyak Stepsize: Surrogate Functions and Negative Results" (2505.20219) offers a novel, unified perspective on the Polyak stepsize and its numerous variants, interpreting them as gradient descent on a surrogate loss function. This framework provides a clear explanation for the adaptivity of the Polyak stepsize and uncovers fundamental limitations regarding convergence to the exact minimum.

Core Idea: Polyak as Gradient Descent on a Surrogate

The central thesis is that the deterministic Polyak stepsize:

$x_{t+1} = x_t - \frac{f(x_t)-f^\star}{\|g_t\|^2}g_t$ , where $g_t \in \partial f(x_t)$

can be viewed as performing subgradient descent on the surrogate function $\phi(x) = \frac{1}{2}(f(x) - f^\star)^2$ . The stepsize used in this surrogate view is $\eta_t = \frac{1}{\|g_t\|^2}$ with respect to the subgradient of $\phi$ , which is $h(x)g_t = (f(x)-f^\star)g_t$ .

The paper shows that this specific stepsize $\frac{1}{\|g_t\|^2}$ (or more precisely, $\frac{1}{(f(x_t)-f^\star)^2} \cdot (f(x_t)-f^\star)^2 = 1$ for the surrogate's gradient, but the effective stepsize on $f$ comes from the form $\frac{f(x_t)-f^\star}{\|g_t\|^2}$ ) is related to a local curvature property of the surrogate function $\phi$ , termed "local star upper curvature" (LSUC).

Key Findings from the Surrogate Perspective:

Unified Explanation for Adaptivity: The surrogate $\phi(x)=\frac12 (f(x)-f(x^\star))^2$ is always locally curved, specifically $\|g_y\|^2$ -LSUC around any point $y$ . This local curvature constant depends only on the gradient norm at the current point, removing the need to estimate global function constants (like Lipschitz constant or strong convexity parameter). This explains the parameter-free and adaptive nature of the Polyak stepsize.
Unified Analysis: This perspective allows for a unified analysis of the Polyak stepsize under various assumptions (Lipschitz, self-bounded, sharp, quadratic growth) by analyzing the properties of the surrogate function $\phi$ . Standard convergence rates for convex optimization can be recovered for $\phi$ , which then translate to rates for the original function $f$ by taking the square root of the suboptimality gap. For example, for a $G$ -Lipschitz function $f$ , $\phi(\bar{x}_T)$ converges as $\mathcal{O}(1/T)$ , implying $f(\bar{x}_T)-f^\star$ converges as $\mathcal{O}(1/\sqrt{T})$ .
Generalization to a Family of Algorithms: The framework is extended to a generalized Polyak stepsize algorithm (Algorithm 1) that operates on a surrogate $\psi(x) = \frac{1}{2}h^2(x)$ , where $h: R^d \to R_{\ge 0}$ is a convex function. Different choices of $h$ correspond to various existing stochastic Polyak-like methods.

Generalized Polyak Stepsize (Algorithm 1) and Stochastic Settings

In the stochastic setting, minimizing $F(x) = E_{\xi \sim D}[f(x, \xi)]$ is considered. Algorithm 1 proposes an update based on a stochastic surrogate $h(x, \xi_t) \ge 0$ : $x_{t+1} = x_t - \eta_t h(x_t, \xi_t) g_t$ , where $g_t \in \partial h(x_t, \xi_t)$ and $\eta_t = \min\left(\frac{1}{\|g_t\|^2}, \frac{\gamma}{h(x_t, \xi_t)}\right)$ .

Different choices for $h(x, \xi_t)$ correspond to existing methods:

SPS $_{max}$ [LoizouVLLJ21]: $h(x, \xi_t) = f(x, \xi_t) - \inf_x f(x, \xi_t)$
SPS $^\ell_{max}$ [OrvietoLJL22]: $h(x, \xi_t) = f(x, \xi_t) - q(\xi_t)$ , where $q(\xi_t)$ is a lower bound on $\inf_x f(x, \xi_t)$
SPS $+$ [GarrigosGS23]: $h(x, \xi_t) = (f(x, \xi_t) - f(x^\star, \xi_t))_+$ , where $x^\star$ is the minimizer of $F(x)$ .

The analysis using the generalized surrogate $\psi(x) = \frac{1}{2} H^2(x)$ , where $H(x) = E_{\xi \sim D}[h(x, \xi)]$ , shows convergence properties. A key finding is the appearance of a convergence neighborhood whose size depends on $H(x^\star) = E[h(x^\star, \xi)]$ . If $H(x^\star) > 0$ , convergence to the exact minimum of $F(x)$ is generally not guaranteed, and the method converges to a neighborhood.

Negative Results: Instability and Neighborhood of Convergence

A significant contribution of the paper is demonstrating that the neighborhood of convergence and potential instability are inherent issues, not just artifacts of the theoretical analysis. This occurs when $h(x^\star) > 0$ in the deterministic setting, or $H(x^\star) > 0$ in the stochastic setting (which happens, for instance, in stochastic Polyak variants without interpolation).

Deterministic Instability: When $h^\star = \min_x h(x) > 0$ , the fixed point $x^\star$ of the deterministic generalized Polyak update can be unstable (Proposition 1 & 2). The stepsize effectively becomes infinite near the minimum relative to the curvature, pushing iterates away. The paper provides a concrete example of a smooth, strongly convex function where the deterministic algorithm cycles and fails to converge to the minimum, even for the best iterate or average iterate (Proposition 3). Another example shows that for a simple 1D quadratic with $h^\star > 0$ , the set of initializations leading to convergence has measure zero (Proposition 4).
Stochastic Failure to Converge: In the stochastic case, $H(x^\star) = E[h(x^\star, \xi)]$ being positive (e.g., in SPS without interpolation) also leads to a convergence neighborhood. The paper shows an example with a mixture of two quadratics where SPS fails to converge to the minimum in expectation (Proposition 5).

These negative results highlight that existing upper bounds featuring non-vanishing terms depending on $H(x^\star)$ are necessary and reflect a real limitation of these methods when interpolation ( $h(x^\star, \xi)=0$ a.s.) does not hold.

Practical Implications

The surrogate function perspective offers a new way to understand and potentially design new adaptive optimization algorithms. One could aim to design surrogate functions $h$ with desirable curvature properties.
The analysis of negative results emphasizes the critical importance of interpolation ( $h(x^\star, \xi)=0$ almost surely) for achieving convergence to the exact minimum in stochastic settings. If interpolation does not hold, these methods are guaranteed to converge only to a neighborhood.
The instability demonstrated in deterministic and stochastic cases provides crucial insights into the behavior of Polyak-like methods and the necessity of techniques like stepsize clipping (as used in ALI-G and SPS $_{max}$ ) to manage the stepsize and prevent divergence or cycling, especially when far from or close to points where $h(x, \xi)$ is small but $h(x^\star)$ or $H(x^\star)$ is positive.

Limitations

The framework primarily relies on the convexity of the function $h$ used in the surrogate. Extending this perspective to more general non-convex settings is an open question. Understanding the optimal class of surrogate functions for achieving fast rates and tight convergence neighborhoods also remains an area for future research.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/bremen79/status/1927768299271496008

https://twitter.com/RyanDOrazio/status/1927435660979216564

https://twitter.com/mathOCb/status/1927234861267394640