Non-convex Finite-Sum Optimization Via SCSG Methods (1706.09156v4)

Published 28 Jun 2017 in math.OC and cs.CC

Abstract: We develop a class of algorithms, as variants of the stochastically controlled stochastic gradient (SCSG) methods (Lei and Jordan, 2016), for the smooth non-convex finite-sum optimization problem. Assuming the smoothness of each component, the complexity of SCSG to reach a stationary point with $\mathbb{E} |\nabla f(x)|^{2}\le \epsilon$ is $O\left (\min{\epsilon^{-5/3}, \epsilon^{{-1}n^{{2/3}}\right)$,}} which strictly outperforms the stochastic gradient descent. Moreover, SCSG is never worse than the state-of-the-art methods based on variance reduction and it significantly outperforms them when the target accuracy is low. A similar acceleration is also achieved when the functions satisfy the Polyak-Lojasiewicz condition. Empirical experiments demonstrate that SCSG outperforms stochastic gradient methods on training multi-layers neural networks in terms of both training and validation loss.

Citations (239)

View on Semantic Scholar

Summary

The paper demonstrates that SCSG methods achieve complexity of O(min{ε^(-5/3), δ^(-1) n^(2/3)}) to reach stationary points, advancing theoretical bounds over SGD.
It validates that SCSG outperforms conventional variance reduction techniques by consistently lowering both training and validation losses in neural network experiments.
The work extends SCSG under the Polyak-Lojasiewicz condition, paving the way for research integrating momentum and adaptive stepsizes in nonconvex optimization.

Overview of Nonconvex Finite-Sum Optimization via SCSG Methods

This paper introduces an innovative class of algorithms based on the stochastically controlled stochastic gradient (SCSG) methods for tackling the nonconvex finite-sum optimization problem. These methods extend the concept of variance reduction to the regime of nonconvex optimization, which notably leads to improvements in both theoretical bounds and empirical performance compared to traditional approaches like stochastic gradient descent (SGD) and other variance-reduction techniques.

Key Contributions

The authors present a comprehensive analysis of SCSG tailored for nonconvex optimization, revealing several important findings:

Complexity Analysis: The paper establishes that SCSG methods can achieve a complexity of $O(\min\{\epsilon^{-5/3}, \delta^{-1} n^{2/3}\})$ to reach a stationary point where $E\|\nabla f(x)\|^2 \leq \epsilon$ . This is a significant step forward from the traditional SGD methods, providing faster convergence rates especially when the target accuracy is moderate to high.
Comparative Performance: SCSG is demonstrated to outperform state-of-the-art variance reduction techniques, particularly when high target accuracy is desired. It ensures a lower data-dependent constant factor than SGD in scenarios of low target accuracy.
Theoretical Extension: The methods adapted in the paper extend the SCSG to cases where the functions meet the Polyak-Lojasiewicz condition, thereby accelerating the convergence rate when such conditions are satisfied.
Empirical Validation: The empirical results, obtained from experiments on multi-layer neural networks, show that SCSG consistently reduces both training and validation loss more effectively than traditional stochastic gradient methods.

Theoretical Implications

The results put forth in this paper hold foundational implications for nonconvex optimization, bridging specific gaps in existing literature regarding convergence guarantees. The analysis of SCSG underscores the potential to remove noise via variance reduction mechanisms effectively, thus paving the way for future research into combining SCSG with other acceleration strategies such as adaptive stepsizes and momentum in deep learning.

Practical Implications

From a practical standpoint, the demonstrated efficiency of SCSG methods can significantly impact real-world applications where nonconvex optimization plays a critical role, such as in machine learning and statistical modeling. The ability to achieve good performance with modest computational resources may encourage broader adoption in complex model training tasks.

Future Directions

The paper surfaces several avenues for future exploration, particularly the integration of SCSG with momentum and adaptive stepsize algorithms. Furthermore, addressing the challenges of implementing SCSG in large-scale systems with inherent noise and exploring its applications beyond the framework of finite-sum problems represent promising research opportunities.

In conclusion, this paper effectively enhances the toolkit available for tackling nonconvex optimization problems, offering both theoretical rigor and practical advancements. The methodologies and insights presented hold considerable promise for future developments in optimization and algorithm design.

PDF Markdown