The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning (1712.06559v3)

Published 18 Dec 2017 in cs.LG and stat.ML

Abstract: In this paper we aim to formally explain the phenomenon of fast convergence of SGD observed in modern machine learning. The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss (classification and regression) close to zero. While it is still unclear why these interpolated solutions perform well on test data, we show that these regimes allow for fast convergence of SGD, comparable in number of iterations to full gradient descent. For convex loss functions we obtain an exponential convergence bound for {\it mini-batch} SGD parallel to that for full gradient descent. We show that there is a critical batch size $m^*$ such that: (a) SGD iteration with mini-batch size $m\leq m^*$ is nearly equivalent to $m$ iterations of mini-batch size $1$ (\emph{linear scaling regime}). (b) SGD iteration with mini-batch $m> m^*$ is nearly equivalent to a full gradient descent iteration (\emph{saturation regime}). Moreover, for the quadratic loss, we derive explicit expressions for the optimal mini-batch and step size and explicitly characterize the two regimes above. The critical mini-batch size can be viewed as the limit for effective mini-batch parallelization. It is also nearly independent of the data size, implying $O(n)$ acceleration over GD per unit of computation. We give experimental evidence on real data which closely follows our theoretical analyses. Finally, we show how our results fit in the recent developments in training deep neural networks and discuss connections to adaptive rates for SGD and variance reduction.

PDF Abstract

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

The paper "The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning" by Siyuan Ma, Raef Bassily, and Mikhail Belkin offers a detailed investigation into the convergence properties of Stochastic Gradient Descent (SGD) in the context of modern over-parametrized learning architectures. The focus is on understanding why SGD, particularly with small mini-batch sizes, shows remarkable performance efficiency despite theoretical analyses often predicting slower convergence compared to full Gradient Descent (GD).

The authors present a formal explanation for this rapid convergence by leveraging the concept of interpolation, particularly in over-parametrized models which are trained to drive empirical loss close to zero. This observation aligns with the practice of achieving nearly perfect fit on training datasets, a ubiquitous phenomenon in deep learning and other modern machine learning methods.

Main Contributions

Exponential Convergence of SGD: The authors demonstrate that in the interpolated regime, SGD converges exponentially for convex loss functions. This is a significant theoretical insight that aligns with the empirical success of SGD. The paper connects this exponential convergence with the Kaczmarz method and other prior works on quadratic functions, although the novelty lies in relating these to modern over-parametrized machine learning architectures.
Critical Mini-batch Size:

A critical mini-batch size $m^*$ is identified, which demarcates two regimes of SGD behavior: - Linear Scaling: For mini-batch sizes $m \leq m^*$ , one iteration of SGD is nearly equivalent to $m$ iterations of mini-batch size one. - Saturation: For $m > m^*$ , one iteration approximates the full GD iteration.

The authors derive that the critical mini-batch size $m^*$ is nearly independent of the data size, resulting in $O(n)$ computational acceleration over GD per unit computation.

Quadratic Loss Analysis: The paper provides a sharp characterization of the convergence regimes for quadratic losses, deriving explicit expressions for optimal mini-batch and step sizes. This allows for precise determination of the linear scaling and saturation regimes.
Experimental Verification: The work is supported by empirical evidence across various datasets, confirming the theoretical findings about mini-batch SGD's efficiency and adaptability in over-parametrized settings.

Implications and Future Work

The findings have both practical and theoretical implications. Practically, they underscore the effectiveness of using small mini-batch sizes in SGD to achieve significant computational savings without compromising convergence rates. Theoretically, this work opens avenues for further exploration into why interpolated solutions generalize well to unseen data, a question that remains partially answered.

Future research could explore understanding the generalization abilities of interpolated models, particularly in non-convex settings or with different types of loss functions. Moreover, extending these insights to develop adaptive learning strategies that can dynamically adjust mini-batch sizes based on convergence characteristics could provide substantial improvements in training efficiency for large-scale machine learning models.

In summary, the paper provides a robust theoretical foundation for the efficiency of SGD in over-parametrized models, offering insights that are validated through rigorous analysis and empirical evidence. These findings contribute significantly to the understanding and refinement of optimization strategies in modern machine learning frameworks.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Siyuan Ma (39 papers)
Raef Bassily (32 papers)
Mikhail Belkin (76 papers)

Citations (275)

View on Semantic Scholar

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning (1712.06559v3)