- The paper establishes that the PL inequality, a less restrictive condition than strong convexity, guarantees global linear convergence for various optimization methods.
- It provides rigorous convergence proofs for gradient, stochastic, and proximal-gradient methods, highlighting improved rates even in non-convex settings.
- Applications to least squares and logistic regression illustrate the practical impact of leveraging the PL condition in simplifying convergence analyses.
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition
Summary
This paper revisits the Polyak-Łojasiewicz (PL) inequality proposed by Polyak in 1963 and repositions it as a less restrictive, yet sufficient, condition for establishing a global linear convergence rate for gradient methods. The authors challenge prior assumptions that rely on strong convexity (SC) and present the argument that PL suffices for demonstrating linear convergence for several optimization algorithms, including randomized and greedy coordinate descent methods, sign-based gradient descent methods, stochastic gradient methods, and proximal-gradient methods for non-smooth optimization.
Introduction
Gradient descent and its variants—coordinate descent, stochastic gradient descent—are fundamental tools in large-scale optimization for machine learning. Traditional analyses require strong convexity to ensure linear convergence rates. However, many machine learning problems (e.g., least squares, logistic regression) do not fulfill strong convexity. Various conditions like Error Bounds (EB), Essential Strong Convexity (ESC), Weak Strong Convexity (WSC), Restricted Secant Inequality (RSI), and Quadratic Growth (QG) have surfaced to guarantee linear convergence without SC. The paper introduces the older Polyak-Łojasiewicz (PL) inequality, claims it's fundamentally weaker than these recent conditions, and simplifies convergence proofs.
Polyak-Łojasiewicz Inequality
The PL inequality is given by: 21∣∣∇f(x)∣∣2≥μ(f(x)−f∗)
for some μ>0. It implies every stationary point is a global minimizer without necessitating uniqueness of solutions. Notably, PL allows global linear convergence of gradient descent even in non-convex settings. Demonstrations of simpler proofs under PL showcase its efficacy over SC and more complex alternative conditions.
Relationships Between Conditions
The relationship between conditions for linear convergence is established, showing: SC→ESC→WSC→RSI≡EB≡PL→QG
For convex functions: RSI≡EB≡PL≡QG
This indicates QG is the least restrictive but insufficient for guaranteeing global minimizers, making PL and EB the most comprehensive.
Relevant Problems
Notable special cases satisfying the PL inequality include:
- Strongly-convex functions.
- Functions composed of strongly-convex functions and linear operators, e.g., least squares.
- Logistic regression functions over compact sets.
Convergence of Huge-Scale Methods
This section asserts new convergence analyses for randomized and greedy coordinate descent, stochastic gradient, and sign-based gradient methods, using the PL inequality.
Randomized Coordinate Descent
For coordinate-wise L-Lipschitz continuous gradients, randomized coordinate descent with step-size $1/L$ achieves: E[f(xk)−f∗]≤(1−dLμ)k[f(x0)−f∗]
highlighting a more generalized applicability beyond SC.
Greedy Coordinate Descent
Using the Gauss-Southwell (GS) rule, the greedy coordinate descent achieves: f(xk)−f∗≤(1−LμL[∞])k[f(x0)−f∗]
where μL[∞] accounts for coordinate selection equivalence.
Stochastic Gradient Methods
Under standard conditions, stochastic gradient descents satisfy: E[f(xk)−f∗]≤2kμ2LC2(decreasing step size)
E[f(xk)−f∗]≤(1−2μα)k[f(x0)−f∗]+4μLC2α(constant step size)
matching rates for strongly-convex functions.
Proximal-Gradient Generalization
For non-smooth optimization problems of the form F(x)=f(x)+g(x), a generalization called the proximal-PL inequality: 21Dg(x,L)≥μ(F(x)−F∗)
ensures linear convergence for proximal-gradient methods.
Relevant Problems
Relevant cases include:
- Strongly-convex f.
- f with g as constant (recovers PL).
- Composite objective f(x)=h(Ax) for SC h and polyhedral g.
Discussion
The pivot to PL and proximal-PL inequalities underscores optimizing effectiveness without SC. This work unifies and simplifies convergence analyses across multiple machine learning methods, positing that future performance assessments in optimization will hinge on adherence to these conditions rather than strong convexity. While experimental results are beyond this theoretical analysis, the implications suggest existing algorithms may have previously underestimated efficacy in standard problems.
In conclusion, the paper provides a significant theoretical shift in understanding gradient and proximal-gradient methods, greatly reducing the complexity of convergence proofs and expanding the applicability of linear convergence guarantees.