Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inexact Proximal Gradient Method

Updated 3 July 2025
  • Inexact Proximal Gradient Method is an iterative approach for composite convex problems that allows approximate gradient and prox computations.
  • It achieves optimal convergence rates by controlling error decay, ensuring speeds of O(1/k) for non-accelerated and O(1/k²) for accelerated schemes.
  • The method enables efficient handling of large-scale problems with structured sparsity by adaptively tightening inner solver tolerances as iterations progress.

An inexact proximal gradient method is an iterative optimization approach for composite problems in which the calculation of either the gradient of the smooth term, the proximal operator of the non-smooth term, or both, is allowed to be computationally approximate rather than exact. This methodology is particularly relevant in large-scale convex optimization, structured regularization, and signal processing, where exact subproblem solves are prohibitively expensive or infeasible. Rigorous convergence rate analysis for such inexact methods was first established for both non-accelerated and accelerated schemes, showing that, under appropriate error decay, the efficiency of exact proximal methods can be retained (1109.2415).

1. Mathematical Formulation and Iterative Scheme

A canonical optimization objective is

minxRdf(x)=g(x)+h(x),\min_{x \in \mathbb{R}^d} f(x) = g(x) + h(x),

where gg is convex and differentiable with an LL-Lipschitz continuous gradient, and hh is convex but possibly non-smooth (e.g., inducing structured sparsity). The standard proximal-gradient (PG) iteration is

xk=proxh,L(yk11Lg(yk1)),x_k = \operatorname{prox}_{h, L}\left(y_{k-1} - \frac{1}{L} g'(y_{k-1})\right),

where yk1y_{k-1} is a momentum/extrapolation point. In the inexact framework, two primary sources of inaccuracy are permitted at each iteration:

  • The gradient of gg is approximated up to an error vector eke_k.
  • The proximal operator is computed only approximately, so that the actual update xkx_k solves

minxh(x)+L2x(yk11L(g(yk1)+ek))2\min_{x} h(x) + \frac{L}{2} \left\| x - \left( y_{k-1} - \frac{1}{L}(g'(y_{k-1}) + e_k) \right) \right\|^2

up to an additive error εk\varepsilon_k in function value.

The accelerated inexact scheme replaces yk1y_{k-1} by a combination of previous iterates following Nesterov’s extrapolation, but the above error sources are retained.

2. Error Criteria and Their Role in Convergence

The convergence analysis requires explicit control on error sequences {ek}\{e_k\} and {εk}\{\varepsilon_k\}. The core requirements are:

  • For basic (non-accelerated) inexact PG, summability of {ek}\{\|e_k\|\} and {εk}\{\sqrt{\varepsilon_k}\} guarantees the optimal O(1/k)O(1/k) rate in function value for convex problems.
  • For accelerated inexact PG, a faster decay is essential: {kek}\{k \|e_k\|\} and {kεk}\{k \sqrt{\varepsilon_k}\} must both be summable to achieve the accelerated O(1/k2)O(1/k^2) rate.
  • In strongly convex cases, linear error decay is sufficient to match the linear convergence rate of the exact method; if the error decays only linearly with rate below the contraction parameter, linear convergence is preserved.

The specific bounds are given via aggregate error sums, for example,

Ak=i=1k(eiL+2εiL),Bk=i=1kεiLA_k = \sum_{i=1}^k \left( \frac{\|e_i\|}{L} + \sqrt{\frac{2\varepsilon_i}{L}} \right), \quad B_k = \sum_{i=1}^k \frac{\varepsilon_i}{L}

and appear in the function value convergence bounds.

A notable conclusion is that per-iteration computations can be relaxed early in the optimization, provided error tolerances are gradually tightened according to these prescriptions.

3. Convergence Rates and Sensitivity to Inexactness

The key convergence results are as follows:

  • Basic inexact PG (convex, non-strongly convex):

f(1ki=1kxi)f(x)L2k(x0x+2Ak+2Bk)2f\left(\frac{1}{k}\sum_{i=1}^k x_i \right) - f(x^\ast) \leq \frac{L}{2k} \left( \| x_0 -x^\ast\| + 2A_k + \sqrt{2B_k} \right)^2

With rapidly decaying errors, the standard O(1/k)O(1/k) rate for exact PG is retained.

  • Accelerated inexact PG:

f(xk)f(x)2L(k+1)2(x0x+2A~k+2B~k)2f(x_k)-f(x^*) \leq \frac{2L}{(k+1)^2} \left( \| x_0 -x^\ast\| + 2\widetilde{A}_k + \sqrt{2\widetilde{B}_k} \right)^2

where A~k\widetilde{A}_k and B~k\widetilde{B}_k incorporate error sums weighted by iteration index (faster decay needed).

  • Strongly convex case: Linear convergence is possible provided the error contracts geometrically faster than the method's contraction factor.

In all cases, the accelerated method is more sensitive to inexactness than the non-accelerated method; inadequate error decay can degrade or destroy the accelerated rate.

Method Error-free Rate Error Decay Needed Achieved Rate Sensitivity
Non-Accelerated (convex) O(1/k)O(1/k) O(1/k1+δ)O(1/k^{1+\delta}) O(1/k)O(1/k) Less
Accelerated (convex) O(1/k2)O(1/k^2) O(1/k2+δ)O(1/k^{2+\delta}) O(1/k2)O(1/k^2) More
Strongly Convex, Non-Accelerated Linear Geometric (rate <1μ/L<1-\mu/L) Linear
Strongly Convex, Accelerated Linear Geometric (rate <1μ/L<1-\sqrt{\mu/L}) Linear

4. Implications for Structured Sparsity and Complex Proximal Operators

In many modern high-dimensional settings, hh encodes structured sparsity via regularizers such as group lasso, total variation, or nuclear norm, for which the proximal mapping does not have a closed form and must be found by iterative methods. The paper demonstrates that:

  • Proximal steps can be computed approximately (using, for example, a fixed number of block coordinate descent steps or solving to a specified duality gap), provided the errors satisfy the required decay.
  • Larger errors are acceptable at earlier iterations, leading to significant computational savings since later iterates (closer to optimality) require more accuracy.
  • This justifies an "adaptive accuracy" policy—tightening the inner solver stopping criteria as the outer PG/accelerated PG converges.

For instance, in overlapping group lasso with hundreds of groups, the inner dual subproblem can be solved only to a moderate gap early on; empirical evidence in the paper shows that this results in faster overall progress.

5. Guiding Principles for Algorithm Design

A plausible implication is that real-world implementations should:

  • Allocate computational effort to lower inner accuracy (looser tolerances) at early iterations and increase it over time.
  • Monitor and adjust error tolerances based on the convergence analysis; for accelerated PG, the decay should be faster than for standard PG.
  • Determine error decay schedules (e.g., εkO(1/k2+δ)\varepsilon_k \sim O(1/k^{2+\delta})) in advance or adaptively.
  • For structured problems, exploit inexactness in the proximity computation to save iterations, as long as convergence criteria are not violated.

This adaptive strategy is theoretically supported and empirically justified for a broad range of composite convex optimization problems.

6. Practical Impact and Summary

Inexact proximal gradient methods, as established in this work, enable the use of cheaper, inexact subroutines without loss of overall convergence speed, provided that error sequences are carefully controlled. In many contemporary applications—such as high-dimensional regression, sparse signal recovery, image restoration, and structured prediction models—this allows larger problem instances and more complex regularizers to be handled efficiently.

The principal takeaway is that the global convergence rates of classical proximal-gradient and accelerated schemes can be preserved under inexactness, provided errors in gradient computation and proximal solution decay according to explicit schedules. Adaptive and problem-structure-exploiting implementations are theoretically and practically advantageous for large-scale optimization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)