Inexact Proximal Gradient Method
- Inexact Proximal Gradient Method is an iterative approach for composite convex problems that allows approximate gradient and prox computations.
- It achieves optimal convergence rates by controlling error decay, ensuring speeds of O(1/k) for non-accelerated and O(1/k²) for accelerated schemes.
- The method enables efficient handling of large-scale problems with structured sparsity by adaptively tightening inner solver tolerances as iterations progress.
An inexact proximal gradient method is an iterative optimization approach for composite problems in which the calculation of either the gradient of the smooth term, the proximal operator of the non-smooth term, or both, is allowed to be computationally approximate rather than exact. This methodology is particularly relevant in large-scale convex optimization, structured regularization, and signal processing, where exact subproblem solves are prohibitively expensive or infeasible. Rigorous convergence rate analysis for such inexact methods was first established for both non-accelerated and accelerated schemes, showing that, under appropriate error decay, the efficiency of exact proximal methods can be retained (1109.2415).
1. Mathematical Formulation and Iterative Scheme
A canonical optimization objective is
where is convex and differentiable with an -Lipschitz continuous gradient, and is convex but possibly non-smooth (e.g., inducing structured sparsity). The standard proximal-gradient (PG) iteration is
where is a momentum/extrapolation point. In the inexact framework, two primary sources of inaccuracy are permitted at each iteration:
- The gradient of is approximated up to an error vector .
- The proximal operator is computed only approximately, so that the actual update solves
up to an additive error in function value.
The accelerated inexact scheme replaces by a combination of previous iterates following Nesterov’s extrapolation, but the above error sources are retained.
2. Error Criteria and Their Role in Convergence
The convergence analysis requires explicit control on error sequences and . The core requirements are:
- For basic (non-accelerated) inexact PG, summability of and guarantees the optimal rate in function value for convex problems.
- For accelerated inexact PG, a faster decay is essential: and must both be summable to achieve the accelerated rate.
- In strongly convex cases, linear error decay is sufficient to match the linear convergence rate of the exact method; if the error decays only linearly with rate below the contraction parameter, linear convergence is preserved.
The specific bounds are given via aggregate error sums, for example,
and appear in the function value convergence bounds.
A notable conclusion is that per-iteration computations can be relaxed early in the optimization, provided error tolerances are gradually tightened according to these prescriptions.
3. Convergence Rates and Sensitivity to Inexactness
The key convergence results are as follows:
- Basic inexact PG (convex, non-strongly convex):
With rapidly decaying errors, the standard rate for exact PG is retained.
- Accelerated inexact PG:
where and incorporate error sums weighted by iteration index (faster decay needed).
- Strongly convex case: Linear convergence is possible provided the error contracts geometrically faster than the method's contraction factor.
In all cases, the accelerated method is more sensitive to inexactness than the non-accelerated method; inadequate error decay can degrade or destroy the accelerated rate.
Method | Error-free Rate | Error Decay Needed | Achieved Rate | Sensitivity |
---|---|---|---|---|
Non-Accelerated (convex) | Less | |||
Accelerated (convex) | More | |||
Strongly Convex, Non-Accelerated | Linear | Geometric (rate ) | Linear | — |
Strongly Convex, Accelerated | Linear | Geometric (rate ) | Linear | — |
4. Implications for Structured Sparsity and Complex Proximal Operators
In many modern high-dimensional settings, encodes structured sparsity via regularizers such as group lasso, total variation, or nuclear norm, for which the proximal mapping does not have a closed form and must be found by iterative methods. The paper demonstrates that:
- Proximal steps can be computed approximately (using, for example, a fixed number of block coordinate descent steps or solving to a specified duality gap), provided the errors satisfy the required decay.
- Larger errors are acceptable at earlier iterations, leading to significant computational savings since later iterates (closer to optimality) require more accuracy.
- This justifies an "adaptive accuracy" policy—tightening the inner solver stopping criteria as the outer PG/accelerated PG converges.
For instance, in overlapping group lasso with hundreds of groups, the inner dual subproblem can be solved only to a moderate gap early on; empirical evidence in the paper shows that this results in faster overall progress.
5. Guiding Principles for Algorithm Design
A plausible implication is that real-world implementations should:
- Allocate computational effort to lower inner accuracy (looser tolerances) at early iterations and increase it over time.
- Monitor and adjust error tolerances based on the convergence analysis; for accelerated PG, the decay should be faster than for standard PG.
- Determine error decay schedules (e.g., ) in advance or adaptively.
- For structured problems, exploit inexactness in the proximity computation to save iterations, as long as convergence criteria are not violated.
This adaptive strategy is theoretically supported and empirically justified for a broad range of composite convex optimization problems.
6. Practical Impact and Summary
Inexact proximal gradient methods, as established in this work, enable the use of cheaper, inexact subroutines without loss of overall convergence speed, provided that error sequences are carefully controlled. In many contemporary applications—such as high-dimensional regression, sparse signal recovery, image restoration, and structured prediction models—this allows larger problem instances and more complex regularizers to be handled efficiently.
The principal takeaway is that the global convergence rates of classical proximal-gradient and accelerated schemes can be preserved under inexactness, provided errors in gradient computation and proximal solution decay according to explicit schedules. Adaptive and problem-structure-exploiting implementations are theoretically and practically advantageous for large-scale optimization.