On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport (1805.09545v2)

Published 24 May 2018 in math.OC, cs.NE, and stat.ML

Abstract: Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.

Citations (678)

View on Semantic Scholar

Summary

The paper presents a particle-based gradient descent approach that leverages optimal transport to achieve global convergence in over-parameterized models.
It demonstrates that correct initialization and sufficient particle numbers transform a non-convex problem into a tractable global optimization task.
Empirical results in neural network training and sparse deconvolution validate the theoretical framework with observable asymptotic behaviors.

Overview of the Paper on Global Convergence of Gradient Descent for Over-parameterized Models via Optimal Transport

The paper "On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport," authored by Lénaïc Chizat and Francis Bach, explores the convergence properties of gradient descent in the context of machine learning models that are over-parameterized. This work provides an in-depth theoretical analysis by leveraging the theory of optimal transport and focuses on models where the parameterization involves a large number of components, such as neural networks with extensive hidden layers.

Problem Definition and Background

The paper addresses the problem of minimizing a convex function defined over an infinite-dimensional space of measures. This framework is pertinent to several machine learning tasks, including sparse spikes deconvolution and neural network training. The core objective is to solve: $J^*=\min_{\mu \in M(\Theta)} J(\mu),$ where $J(\mu)$ involves a loss function $R$ and a regularizer $G$ .

The approach taken by the authors involves discretizing the measure into a mixture of particles, upon which gradient descent is performed on their weights and positions. Although $J(\mu)$ is non-convex when discretized, the paper shows that the correct initialization and a sufficiently large number of particles lead to global convergence towards minimizers.

Theoretical Contributions

Particle Gradient Descent: The method translates the minimization problem into a particle-based gradient descent framework. It relies on continuous-time gradient flows and evaluates the convergence behavior as the number of particles $m$ approaches infinity.
Wasserstein Gradient Flows: Central to the paper's analysis is the characterization of many-particle limits using Wasserstein gradient flows, derived from optimal transport theory. This perspective allows the examination of convergence properties in a more abstract space of probability measures.
Global Convergence Criteria: The authors present conditions under which the non-convex particle gradient descent can reliably reach global minima. They identify critical homogeneity properties of specific functions and requisite initialization patterns in the limit of an infinite number of particles.
Sard Type Regularity and Homogeneity: Two cases are extensively analyzed:
- 2-Homogeneous Functions: Relevant for neural networks with ReLU activation functions.
- 1-Homogeneous Functions: Applicable to bounded functions, such as those with sigmoid activations.

Empirical Insights and Implications

The paper supports its theoretical findings with empirical evidence. Numerical experiments indicate that the asymptotic behavior theorized is observable even with a finite number of particles. This is demonstrated through experiments in sparse deconvolution and neural network training scenarios.

Implications:

The approach provides a methodological framework that ensures the training process of large-scale neural networks, often plagued by local minima due to over-parameterization, converges to a global optimum.
The analysis underscores the potential of particle methods in high-dimensional optimization, inherently reflective of mean-field theories.

Future Directions:

Extending the framework to multilayer networks and deeper architectures.
Quantitative convergence results could further elucidate the particle complexity required for practical applications.
Exploration of stochastic variations of the proposed methodology, such as stochastic gradient descent.

Conclusion

This paper contributes significantly to the understanding of gradient descent dynamics in over-parameterized systems and highlights the nuanced interplay between optimal transport and non-convex optimization. It provides a consistent theoretical foundation that justifies the practice of using large, well-initialized parameter spaces in machine learning, particularly in the field of neural networks.

PDF Markdown