Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Proximal Stochastic Gradient Method with Progressive Variance Reduction (1403.4699v1)

Published 19 Mar 2014 in math.OC and stat.ML

Abstract: We consider the problem of minimizing the sum of two convex functions: one is the average of a large number of smooth component functions, and the other is a general convex function that admits a simple proximal mapping. We assume the whole objective function is strongly convex. Such problems often arise in machine learning, known as regularized empirical risk minimization. We propose and analyze a new proximal stochastic gradient method, which uses a multi-stage scheme to progressively reduce the variance of the stochastic gradient. While each iteration of this algorithm has similar cost as the classical stochastic gradient method (or incremental gradient method), we show that the expected objective value converges to the optimum at a geometric rate. The overall complexity of this method is much lower than both the proximal full gradient method and the standard proximal stochastic gradient method.

Citations (729)

Summary

  • The paper presents a proximal stochastic gradient method (Prox-SVRG) that achieves geometric convergence by progressively reducing gradient variance under strong convexity.
  • It employs a multi-stage scheme with periodic full gradient computations to enhance efficiency in large-scale, regularized empirical risk minimization problems.
  • Experimental results demonstrate that Prox-SVRG outperforms both traditional full and stochastic gradient methods, as well as advanced algorithms like Prox-SDCA.

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

The paper "A Proximal Stochastic Gradient Method with Progressive Variance Reduction" by Lin Xiao and Tong Zhang addresses the optimization problem of minimizing the sum of two convex functions: one is an average of a large number of smooth components, and the other is a general convex function that admits a straightforward proximal mapping. This type of problem often arises in machine learning, particularly in the context of regularized empirical risk minimization (ERM).

Problem Definition and Background

The optimization problem under consideration is:

minxRdP(x)=F(x)+R(x),\min_{x \in \mathbb{R}^d} P(x) = F(x) + R(x),

where

F(x)=1ni=1nfi(x),F(x) = \frac{1}{n} \sum_{i=1}^n f_i(x),

and R(x)R(x) is a convex function that may be non-differentiable. The authors assume that P(x)P(x) is strongly convex.

Such a framework is highly relevant in machine learning for problems like least-squares regression and logistic regression, especially when dealing with massive datasets. Handling these large-scale problems efficiently requires scalable algorithms, prompting the need for methods beyond conventional full gradient or simple stochastic gradient approaches.

Proximal Stochastic Gradient Method

The proposed algorithm, Prox-SVRG (Proximal Stochastic Variance-Reduced Gradient), is a multi-stage stochastic gradient method that progressively reduces the variance of the stochastic gradient estimates. This drawback inherent in traditional stochastic gradient methods affects their convergence rates adversely, especially in strongly convex settings.

Key Features

  • Multi-Stage Scheme: At each stage, the algorithm computes a full gradient periodically while performing a series of proximal stochastic gradient steps based on a modified gradient that includes variance reduction.
  • High Computational Efficiency: Each iteration has a cost similar to a classic stochastic gradient method, but Prox-SVRG achieves geometric convergence in expectation, significantly improving overall computational complexity.
  • Theoretical Guarantees: Under standard conditions (Lipschitz continuity of gradients and strong convexity), the method converges with complexity O((n+LQ/μ)log(1/ϵ))O\left( (n + L_Q / \mu) \log (1/\epsilon) \right), where LQL_Q is a condition number influenced by the sampling distribution and the scaling constants of the component functions.

Convergence Analysis

The strength of Prox-SVRG lies in its theoretical backing. The authors derive a comprehensive convergence analysis, demonstrating that the variance of the gradient can be controlled and reduced over iterations, thus allowing the use of a constant step size, enhancing the convergence rate.

Numerical Results

Experiments confirm the practical efficiency of Prox-SVRG, significantly outperforming both full gradient methods (e.g., Prox-FG) and standard proximal stochastic gradient methods (Prox-SG). Notably, Prox-SVRG performs comparably or superiorly to other advanced methods like Prox-SDCA and Prox-SAG, especially in cases with large condition numbers and substantial component-wise variability.

Implications and Future Directions

Practically, Prox-SVRG can handle large-scale optimization problems typical in machine learning scenarios, such as training logistic regression models on extensive datasets. Theoretically, the introduction of variance reduction within a proximal framework sheds light on bridging the gap between the stochastic and full gradient methods, offering a pathway to devising even more efficient algorithms.

Future research could explore:

  • Extensions of Prox-SVRG to non-convex settings often encountered in deep learning.
  • Adaptive methods that automatically balance the trade-off between gradient evaluations and variance reduction steps.
  • Improved analysis and understanding of weighted sampling strategies to further optimize Prox-SVRG's performance.

Conclusion

Lin Xiao and Tong Zhang's work presents a substantial advance in the domain of proximal stochastic gradient methods, particularly in handling large datasets with a structured composite objective function. Prox-SVRG not only enhances convergence rates but also provides a robust theoretical foundation applicable to various machine learning applications, marking a significant step toward more efficient large-scale optimization algorithms.