General convergence of noiseless distributional dynamics

Establish a general convergence result for the continuity-equation distributional dynamics ∂_t ρ_t(θ) = 2 ξ(t) ∇_θ · (ρ_t(θ) ∇_θ Ψ(θ;ρ_t)) that arises as the mean-field limit of noiseless stochastic gradient descent on two-layer neural networks, demonstrating convergence to a fixed point (e.g., a global minimizer) under broad assumptions on the activation σ_*, data distribution P(X,Y), step-size schedule ξ(t), and initialization ρ_0.

Background

The authors prove global convergence for noisy SGD by analyzing the associated diffusion PDE (distributional dynamics with a Laplacian term), which is a gradient flow of a strongly convex free energy and admits a unique fixed point. In contrast, the noiseless case corresponds to a continuity equation without diffusion, for which only stability results for certain fixed points are provided.

A general theorem guaranteeing convergence of the noiseless distributional dynamics (and thus noiseless SGD in the mean-field limit) remains unavailable, motivating a precise question about conditions under which this dynamics converges to a fixed point.

References

In the next sections we state our results about convergence of the distributional dynamics to its fixed point. In the case of noisy SGD (and for the diffusion PDE #1{eq:GeneralPDE_Temp}), a general convergence result can be established (although at the cost of an additional regularization). For noiseless SGD (and the continuity equation #1{eq:GeneralPDE_Temp}), we do not have such general result.

— A Mean Field View of the Landscape of Two-Layers Neural Networks (1804.06561 - Mei et al., 2018) in Subsection "Convergence: noiseless SGD"

General convergence of noiseless distributional dynamics

Background

References

Related Problems