SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives (1407.0202v3)

Published 1 Jul 2014 in cs.LG, math.OC, and stat.ML

Abstract: In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method.

Citations (1,764)

View on Semantic Scholar

Summary

The paper introduces SAGA, an incremental gradient method that reduces variance using averaged gradients and achieves linear convergence in strongly convex scenarios.
It adapts to non-strongly convex composite objectives by integrating a proximal operator for regularization, streamlining empirical risk minimization.
Experimental results on datasets like MNIST and COVTYPE demonstrate SAGA’s efficiency and reliability compared to algorithms such as SAG, SVRG, and SDCA.

SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

The paper "SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives" introduces a novel incremental gradient method known as SAGA. This method addresses some of the limitations inherent in existing incremental gradient algorithms such as SAG, SDCA, MISO, and SVRG, particularly in handling non-strongly convex composite objectives.

Overview of SAGA

The SAGA algorithm is an incremental optimization method designed to efficiently handle composite objectives that are common in machine learning tasks. Such objectives typically consist of an empirical risk term and a regularizer. SAGA differentiates itself by supporting non-strongly convex problems directly and adapting to any inherent strong convexity in the problem, making it versatile across various problem settings.

Algorithm Description

SAGA builds upon the theoretical frameworks of its predecessors SAG and SVRG but with notable improvements. The algorithm maintains an average of past gradients and updates an estimated gradient in a way that reduces variance, ensuring linear convergence for a broader class of problems. Specifically, the SAGA updates are computed as follows:

$w^{k+1}=x^{k}- \left[ f_{j}^{\prime}(\phi_{j}^{k+1})-f_{j}^{\prime}(\phi_{j}^{k})+\frac{1}{n}\sum^{n}_{i=1}f_{i}^{\prime}(\phi_{i}^{k})\right],$

where $x^{k+1}$ is updated using the proximal operator for the regularizer $h$ :

$x^{k+1}=\text{prox}^{h}_{}\left(w^{k+1}\right).$

The theoretical step size for strongly convex functions is given by $=\frac{1}{2(\mu n+L)}$ , with a convergence rate:

$\mathbb{E}\left\Vert x^{k}-x^{*}\right\Vert ^{2}\leq\left(1-\frac{\mu}{2(\mu n+L)}\right)^{k}\left[\left\Vert x^{0}-x^{*}\right\Vert ^{2} +\frac{n}{\mu n + L} \left[f(x^{0}) -\left\langle f^{\prime}(x^{*}),x^{0}-x^{*}\right\rangle - f(x^{*})\right]\right].$

Theoretical Contributions

The paper provides rigorous proofs for the theoretical convergence rates of SAGA, establishing its superiority over previous methods like SAG and SVRG in certain conditions. Specifically, it offers improved convergence rates in the strongly convex case while also being applicable to non-strongly convex problems without modification. This dual capability is highlighted by the adaptive nature of the algorithm to any present strong convexity, reflected in its convergence rate modifications to handle varying levels of problem convexity.

Experimental Validation

Extensive experiments demonstrate the effectiveness of SAGA. The algorithm was tested on several datasets like MNIST, COVTYPE, IJCNN1, and MILLIONSONG, for binary classification and least-squares prediction tasks. Both L2 and L1 regularizations were evaluated, showing that SAGA performs comparably to other state-of-the-art methods such as SVRG, Finito, and SDCA. The results indicate that while SAGA is not always the fastest, it offers a balance between requiring fewer gradient evaluations and practical convergence speed, making it a versatile tool in various problem settings.

Implications and Future Work

Practically, SAGA's capability to handle non-strongly convex issues directly without additional regularization parameters simplifies the tuning process in empirical risk minimization tasks. Theoretically, SAGA bridges the gap between stochastic variance reduction methods and incremental average gradient techniques, offering a unified view that simplifies the analysis and implementation of these algorithms.

Future directions could involve extending SAGA to handle more complex and larger-scale machine learning problems more efficiently. Specifically, enhancing its performance further through adaptations to specific types of data (e.g., sparse data) or model architectures (e.g., neural networks) might be fruitful. Additionally, exploring hybrid methods that combine features of SAGA with other stochastic optimization techniques could yield new insights and performance gains.

In conclusion, SAGA provides a robust, efficient, and adaptable optimization framework for machine learning tasks, outperforming several existing methods in theoretical convergence rates and practical applicability.