Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting (1504.04407v2)

Published 16 Apr 2015 in cs.LG and stat.ML

Abstract: We propose mS2GD: a method incorporating a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent (S2GD). We consider the problem of minimizing a strongly convex function represented as the sum of an average of a large number of smooth convex functions, and a simple nonsmooth convex regularizer. Our method first performs a deterministic step (computation of the gradient of the objective function at the starting point), followed by a large number of stochastic steps. The process is repeated a few times with the last iterate becoming the new starting point. The novelty of our method is in introduction of mini-batching into the computation of stochastic steps. In each step, instead of choosing a single function, we sample $b$ functions, compute their gradients, and compute the direction based on this. We analyze the complexity of the method and show that it benefits from two speedup effects. First, we prove that as long as $b$ is below a certain threshold, we can reach any predefined accuracy with less overall work than without mini-batching. Second, our mini-batching scheme admits a simple parallel implementation, and hence is suitable for further acceleration by parallelization.

Citations (268)

View on Semantic Scholar

Summary

The paper presents a mini-batch semi-stochastic gradient descent algorithm that improves convergence rates by reducing gradient variance through batch processing.
It combines full gradient evaluations with stochastic updates, allowing efficient parallelization for large-scale convex optimization problems.
Experimental results demonstrate that mS2GD reduces overall gradient evaluations and achieves faster linear convergence compared to traditional methods.

Overview of "Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting"

The paper introduces a method named mini-batch semi-stochastic gradient descent (mS2GD) tailored for minimizing a strongly convex function combined with a nonsmooth convex regularizer. This approach enhances the semi-stochastic gradient descent methodology (S2GD) by incorporating a mini-batching strategy, leading to improved convergence rates and computational efficiency. The innovation lies in utilizing batches of samples to compute gradient estimates, potentially taking advantage of parallel computing environments.

Background and Motivation

In optimization problems, especially those prevalent in machine learning and signal processing, it is common to encounter functions that are the sum of multiple smooth convex components paired with a nonsmooth convex penalty. Traditional deterministic methods like proximal gradient descent can be inefficient when the dataset is large, thus motivating the use of stochastic variants.

Stochastic gradient descent (SGD) offers iteration cost proportional to the dataset size but suffers from issues such as non-vanishing variance, which hinders reaching high accuracy. Modern stochastic methods such as SAG, SDCA, and S2GD marked improvements in variance reduction without relying on mini-batching or decreasing step sizes, which traditional stochastic methods often require.

The Mini-Batch S2GD Algorithm

The mS2GD algorithm enhances S2GD by integrating a mini-batching strategy during the stochastic updates. The algorithm operates in epochs, where each epoch begins with a full gradient computation followed by multiple stochastic updates. Each stochastic step within an epoch uses a mini-batch of functions to compute a gradient estimate. This chunking of data points allows for two significant efficiency gains:

Batching Efficiency: Aggregating gradients over mini-batches reduces the gradient variance, leading to fewer iterations to achieve a given accuracy level.
Parallelization: The mini-batch approach lends itself to parallelization, potentially decreasing wall-time significantly compared to pure stochastic updates.

Theoretical Insights

Theoretical analysis demonstrates that mS2GD converges at a linear rate under certain conditions, outperforming related methods that do not employ mini-batches. The algorithm benefits from being amenable to parallel computing settings, lending itself well to distributed computation paradigms integral to handling large-scale data.

Significantly, the authors show that for mini-batch sizes up to a specific threshold, the total computational workload in terms of gradient evaluations remains lower than that of non-mini-batched approaches. This threshold depends on the size of the dataset and the strong convexity parameter of the objective.

Numerical Experimentation

The authors conduct experiments comparing mS2GD's performance against several contemporary optimization algorithms, including classic SGD variants and advanced stochastic methods like FISTA and Prox-SVRG. The experiments, run on datasets prevalent in machine learning benchmarking, consistently show mS2GD achieving faster convergence rates and higher efficiency when leveraging mini-batches and parallelism.

Practical and Theoretical Implications

The mS2GD algorithm provides a robust framework for efficiently solving large-scale convex optimization problems. In practice, the method's adaptability to parallel computation aligns with the trend towards distributed machine learning models, where managing computation efficiently across multiple processors enhances scalability. Theoretically, mS2GD contributes to the growing body of work on bridging the gap between stochastic approximation and deterministic gradient methods, providing a viable method that retains the benefits of both approaches.

Future Directions

Future research could explore adaptive strategies within the mini-batch framework to dynamically adjust batch sizes or step sizes based on convergence diagnostics. Further investigation into the applications of mS2GD across more diverse problem domains, such as non-convex optimization problems, would significantly expand the method's utility.

The paper successfully outlines significant advancements in efficiently tackling optimization problems by combining stochastic gradient insights with batch processing, paving the way for further innovations in optimization methodologies.

PDF Markdown