Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition (1608.04636v4)

Published 16 Aug 2016 in cs.LG, math.OC, stat.CO, and stat.ML

Abstract: In 1963, Polyak proposed a simple condition that is sufficient to show a global linear convergence rate for gradient descent. This condition is a special case of the \L{}ojasiewicz inequality proposed in the same year, and it does not require strong convexity (or even convexity). In this work, we show that this much-older Polyak-\L{}ojasiewicz (PL) inequality is actually weaker than the main conditions that have been explored to show linear convergence rates without strong convexity over the last 25 years. We also use the PL inequality to give new analyses of randomized and greedy coordinate descent methods, sign-based gradient descent methods, and stochastic gradient methods in the classic setting (with decreasing or constant step-sizes) as well as the variance-reduced setting. We further propose a generalization that applies to proximal-gradient methods for non-smooth optimization, leading to simple proofs of linear convergence of these methods. Along the way, we give simple convergence results for a wide variety of problems in machine learning: least squares, logistic regression, boosting, resilient backpropagation, L1-regularization, support vector machines, stochastic dual coordinate ascent, and stochastic variance-reduced gradient methods.

Citations (1,135)

Summary

  • The paper establishes that the PL inequality, a less restrictive condition than strong convexity, guarantees global linear convergence for various optimization methods.
  • It provides rigorous convergence proofs for gradient, stochastic, and proximal-gradient methods, highlighting improved rates even in non-convex settings.
  • Applications to least squares and logistic regression illustrate the practical impact of leveraging the PL condition in simplifying convergence analyses.

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition

Summary

This paper revisits the Polyak-Łojasiewicz (PL) inequality proposed by Polyak in 1963 and repositions it as a less restrictive, yet sufficient, condition for establishing a global linear convergence rate for gradient methods. The authors challenge prior assumptions that rely on strong convexity (SC) and present the argument that PL suffices for demonstrating linear convergence for several optimization algorithms, including randomized and greedy coordinate descent methods, sign-based gradient descent methods, stochastic gradient methods, and proximal-gradient methods for non-smooth optimization.

Introduction

Gradient descent and its variants—coordinate descent, stochastic gradient descent—are fundamental tools in large-scale optimization for machine learning. Traditional analyses require strong convexity to ensure linear convergence rates. However, many machine learning problems (e.g., least squares, logistic regression) do not fulfill strong convexity. Various conditions like Error Bounds (EB), Essential Strong Convexity (ESC), Weak Strong Convexity (WSC), Restricted Secant Inequality (RSI), and Quadratic Growth (QG) have surfaced to guarantee linear convergence without SC. The paper introduces the older Polyak-Łojasiewicz (PL) inequality, claims it's fundamentally weaker than these recent conditions, and simplifies convergence proofs.

Polyak-Łojasiewicz Inequality

The PL inequality is given by: 12f(x)2μ(f(x)f)\frac{1}{2}|| \nabla f(x) ||^2 \geq \mu (f(x) - f^*) for some μ>0\mu > 0. It implies every stationary point is a global minimizer without necessitating uniqueness of solutions. Notably, PL allows global linear convergence of gradient descent even in non-convex settings. Demonstrations of simpler proofs under PL showcase its efficacy over SC and more complex alternative conditions.

Relationships Between Conditions

The relationship between conditions for linear convergence is established, showing: SCESCWSCRSIEBPLQG\text{SC} \rightarrow \text{ESC} \rightarrow \text{WSC} \rightarrow \text{RSI} \equiv \text{EB} \equiv \text{PL} \rightarrow \text{QG} For convex functions: RSIEBPLQG\text{RSI} \equiv \text{EB} \equiv \text{PL} \equiv \text{QG} This indicates QG is the least restrictive but insufficient for guaranteeing global minimizers, making PL and EB the most comprehensive.

Relevant Problems

Notable special cases satisfying the PL inequality include:

  1. Strongly-convex functions.
  2. Functions composed of strongly-convex functions and linear operators, e.g., least squares.
  3. Logistic regression functions over compact sets.

Convergence of Huge-Scale Methods

This section asserts new convergence analyses for randomized and greedy coordinate descent, stochastic gradient, and sign-based gradient methods, using the PL inequality.

Randomized Coordinate Descent

For coordinate-wise LL-Lipschitz continuous gradients, randomized coordinate descent with step-size $1/L$ achieves: E[f(xk)f](1μdL)k[f(x0)f]\mathbb{E}[ f(x_k) - f^*] \leq \left( 1 - \frac{\mu }{ dL}\right)^k[f(x_0) - f^*] highlighting a more generalized applicability beyond SC.

Greedy Coordinate Descent

Using the Gauss-Southwell (GS) rule, the greedy coordinate descent achieves: f(xk)f(1μL[]L)k[f(x0)f]f(x_k) - f^* \leq \left(1 - \frac{\mu_{L[\infty]}}{L}\right)^k[f(x_0) - f^*] where μL[]\mu_{L[\infty]} accounts for coordinate selection equivalence.

Stochastic Gradient Methods

Under standard conditions, stochastic gradient descents satisfy: E[f(xk)f]LC22kμ2(decreasing step size)\mathbb{E}[f(x_k) - f^*] \leq \frac{LC^2}{2k\mu^2} \quad \text{(decreasing step size)}

E[f(xk)f](12μα)k[f(x0)f]+LC2α4μ(constant step size)\mathbb{E}[f(x_k) - f^*] \leq (1 - 2\mu \alpha)^k [f(x_0) - f^*] + \frac{LC^2 \alpha}{4\mu} \quad \text{(constant step size)}

matching rates for strongly-convex functions.

Proximal-Gradient Generalization

For non-smooth optimization problems of the form F(x)=f(x)+g(x)F(x) = f(x) + g(x), a generalization called the proximal-PL inequality: 12Dg(x,L)μ(F(x)F)\frac{1}{2} \mathcal{D}_g(x,L) \ge \mu (F(x) - F^*) ensures linear convergence for proximal-gradient methods.

Relevant Problems

Relevant cases include:

  1. Strongly-convex ff.
  2. ff with gg as constant (recovers PL).
  3. Composite objective f(x)=h(Ax)f(x) = h(Ax) for SC hh and polyhedral gg.

Discussion

The pivot to PL and proximal-PL inequalities underscores optimizing effectiveness without SC. This work unifies and simplifies convergence analyses across multiple machine learning methods, positing that future performance assessments in optimization will hinge on adherence to these conditions rather than strong convexity. While experimental results are beyond this theoretical analysis, the implications suggest existing algorithms may have previously underestimated efficacy in standard problems.

In conclusion, the paper provides a significant theoretical shift in understanding gradient and proximal-gradient methods, greatly reducing the complexity of convergence proofs and expanding the applicability of linear convergence guarantees.

Youtube Logo Streamline Icon: https://streamlinehq.com