Stochastic subgradient method converges on tame functions (1804.07795v3)

Published 20 Apr 2018 in math.OC and cs.LG

Abstract: This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science---including all popular deep learning architectures.

Citations (237)

View on Semantic Scholar

Summary

The paper establishes that the stochastic subgradient method converges to first-order stationary points for tame functions under weak regularity conditions.
It employs a Lyapunov function derived from the objective to guarantee a systematic decrease along iterations, ensuring convergence.
The study leverages Whitney stratifiability to extend convergence results to semialgebraic, nonsmooth, and nonconvex functions common in deep learning.

Convergence of the Stochastic Subgradient Method on Tame Functions

The paper "Stochastic Subgradient Method Converges on Tame Functions" addresses an important question in the field of optimization and learning algorithms: What are the convergence guarantees of the stochastic subgradient method when dealing with nonsmooth and nonconvex functions? The authors tackle this question head-on by establishing that, for a broad class of such functions, the stochastic subgradient method is capable of converging to points that are first-order stationary, which include critical points of the function being optimized.

Context and Methodology

The core problem setup considers optimizing a function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ that is locally Lipschitz continuous. The stochastic subgradient method, a commonly used optimization technique, is framed as an iterative update enriched with stochastic elements. These updates take the form:

$x_{k+1} = x_k - \alpha_k (y_k + \xi_k),$

where $y_k$ belongs to the Clarke subdifferential of $f$ at $x_k$ , $\alpha_k$ is a diminishing step-size, and $\xi_k$ represents the stochastic noise. Notably, the method maintains flexibility due to the absence of convexity and smoothness assumptions.

Key Results

A significant contribution of this paper is the analysis of the convergence behavior under minimal assumptions and on a broad class of functions. The results are quite general as they apply to semialgebraic functions, which are present in many practical problems in data science, including most popular deep learning architectures.

Convergence Guarantees: The paper establishes that under certain conditions (particularly, Whitney stratifiability of the graph of the function), the iterates produced by the stochastic subgradient method are guaranteed to converge to critical points of $f$ .
Lyapunov Function Construction: The authors employ a Lyapunov function—essentially the objective function itself—and demonstrate that it decreases along the trajectories formed by the iterations. This ensures eventual convergence to a point that satisfies the necessary condition for local optimality (critical point).
Characterization of Function Classes: The paper identifies additional structural properties on functions that make them amenable to the stochastic method. The concept of Whitney stratifiability allows functions to be decomposed into manageable manifold structures, enabling the use of differential inclusions to describe the dynamics of descent.

Implications and Prospects

The theoretical implications of this work are broad, offering a framework where many algorithms using subgradient methods can be evaluated for convergence properties in non-standard settings. Practically, the results facilitate the deployment of optimization techniques like stochastic subgradients in training deep networks with non-smooth activation functions, such as ReLU. This observation is key given the prominence of such networks in machine learning.

The paper also sets the stage for further research into convergence behavior on even wider classes of functions, possibly extending these results by examining other types of subdifferentials or considering different types of measures for convergence.

Conclusion

This paper successfully extends the understanding of convergence properties for the stochastic subgradient method in settings that deviate from classical convex and smooth scenarios. By providing rigorous guarantees under mild structural assumptions, it establishes that such optimization techniques can be practical and reliable tools even when faced with complex, real-world problems characterized by non-convexity and nonsmoothness. The results are likely to stimulate continued exploration into the structural complexities of optimization landscapes encountered in data science and machine learning applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/damekdavis/status/1750971795149127946