On convergence rates of subgradient descent on semialgebraic functions

Published 18 Apr 2026 in math.OC | (2604.17060v1)

Abstract: We analyze the constant step size subgradient method on nonsmooth, nonconvex functions. We identify geometric assumptions on the objective function under which i) its domain admits a partition (stratification) into smooth manifolds (strata) on which the function is smooth; ii) a global projection formula for Clarke subgradients holds; and iii) quantitative curvature bounds hold on each stratum. Under these conditions, we prove that the iterates of the subgradient method locally shadow a Riemannian gradient descent on nearby strata, which we use to measure stationarity. We introduce a selection rule for the active stratum and develop a mechanism that assembles local descent inequalities across successive strata into explicit convergence rates. These rates are expressed in terms of the number of dimensions present in the stratification, improve as the number of strata decreases, and recover, up to constants, the classical rates in the smooth case. We show that the stated assumptions follow from the existence of Lipschitz stratifications of semialgebraic sets, and are therefore automatically satisfied for semialgebraic functions and, more generally, for functions definable in polynomially bounded o-minimal structures, yielding the first known convergence rates in these settings. As intermediate results of independent interest, we establish tubular neighborhood estimates for Lipschitz stratifications and a global projection formula for Clarke subgradients. Finally, we show that our framework extends to decreasing step size and recovers, via an alternative argument, the recently announced result of Lai and Song on sequential convergence of the subgradient method with step sizes 1/k.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces explicit finite-time rates by linking subgradient descent iterates to Riemannian gradient flows on stratified smooth manifolds.
It leverages geometric stratifications with global projection formulas and Lipschitz bounds to rigorously control subgradient errors.
Dynamic stratum selection and quantified error bounds provide nonasymptotic stationarity certificates for nonsmooth optimization tasks.

Convergence Rates of Subgradient Descent for Semialgebraic Functions

Introduction and Problem Context

The paper develops a finite-time complexity analysis for subgradient descent (SubGD) when applied to broad classes of nonsmooth, nonconvex semialgebraic functions. This extends the theory significantly beyond previous results, which were typically asymptotic or limited to smooth/convex regimes. The analysis leverages the geometric structure of semialgebraic functions, specifically the existence of stratifications of the domain into smooth manifolds ("strata"), and develops new tools for tracking algorithmic progress across such stratifications.

Geometric Structure and Stratification

A central insight is that semialgebraic functions admit a partition of their domain (a stratification) into finitely many smooth manifolds, on each of which the function is smooth. More importantly, the Clarke subdifferential at a point on a stratum can be projected onto the stratum's tangent space to recover the Riemannian gradient of the function restricted to that stratum. The paper formalizes this via a global projection formula for Clarke subgradients, and provides quantitative, uniform Lipschitz bounds on the variation of tangent planes and the Riemannian gradients across each stratum.

The implications are twofold: (i) subgradient descent iterates, although evolving in a nonsmooth, nonconvex landscape, locally shadow Riemannian gradient flows on nearby strata; and (ii) finite-time error bounds can be quantified in terms of the geometry of the stratification and the stationarity property along these smooth manifolds.

Figure 1: Visualization of piecewise smooth structure enabled by stratification; subgradient descent trajectory transitions between strata.

Technical Contributions: Error Bounds and Algorithmic Construction

Assumptions and Main Theorems

The main theorems operate under piecewise smoothness and quantitative geometric regularity of the stratification, conditions which are shown to hold automatically for any semialgebraic function (and more generally for any function definable in a polynomially bounded o-minimal structure) on a compact domain. Explicitly:

Each stratum is a $C^p$ submanifold and $f$ restricted to it is $C^p$ .
The inner and Euclidean metrics on the stratum are equivalent.
There is a Lipschitz bound on the tangent space variation and a global error bound for projecting Clarke subgradients onto stratum tangent spaces.

These geometric properties enable the establishment of explicit finite-time convergence rates. More precisely: for constant step size SubGD, for any sufficiently small step size $\gamma$ and any number of steps $K$ , there exists a sequence of active strata $(\Psi_k)_{k=1}^K$ such that the iterates $x_k$ remain within $O(\gamma^{\alpha+\operatorname{rank}(\Psi_k)\beta})$ of the active stratum at each iteration, and:

$\frac{1}{K} \sum_{k=1}^K \|\nabla g_{\Psi_k}(x_k)\|^2 \leq C \left( \frac{f(x_1) - f(x_{K+1})}{\gamma K} + \gamma^{\beta-\alpha} + \gamma^{2\alpha} \right)$

where $g_{\Psi_k}=f\circ \pi_{\Psi_k}$ and the rates depend on the hierarchical structure of the stratification via the so-called "rank" $f$ 0 (number of distinct stratum dimensions).

Notion of Stationarity

The convergence rate is measured in terms of the norm of the Riemannian gradient of $f$ 1 projected onto the current active stratum, which, via the global geometric error control, yields a non-asymptotic stationarity certificate. Importantly, this avoids the need to rely on surrogates such as the Moreau envelope or Goldstein stationarity—commonly used in prior work with weaker convergence proofs.

Figure 2: Subgradient descent trajectory with transitions between two strata; trajectory shadows Riemannian gradient flow along each stratum.

Combinatorial Construction: Strata Selection Mechanism

An essential component is the dynamic mechanism for "active stratum selection." The algorithm constructs, at each iteration, a stratum $f$ 2 such that the iterate $f$ 3 is sufficiently close to $f$ 4 (but separated from lower-dimensional strata), and only switches strata if the iterate traverses the "buffer zone"—the region outside of the thin tubular neighborhood of the current stratum.

The paper gives a constructive, combinatorial algorithm for this selection process (see Algorithm~1 in the paper), showing that the number of switches can be bounded and thus the potential switching-induced error in the finite-time bound remains controllable.

Figure 3: Schematic depiction of stratification of a set, along with inner (thin) and outer (thick) neighborhoods defining buffer zones for stratum selection.

Quantitative Rates and Dimensional Dependence

The rates depend polynomially on the step size $f$ 5, with the exponent determined by the number of distinct active stratum dimensions ("rank" $f$ 6). In the worst case ( $f$ 7), this is exponential in dimension, but in practice the number of active stratum dimensions along actual trajectories can be much smaller. The analysis is resilient to arbitrary semialgebraic (or definable) nonsmoothness and recovers classical smooth rates as a special case.

Methodological Generalizations

Extensions to Decreasing Step Sizes

The framework extends to variable, decreasing step sizes. Using a standard doubling trick and adapting the combinatorial assignment of active strata to time-varying neighborhood scales, one obtains non-asymptotic complexity guarantees. For sufficiently fast step decay, the sequential convergence of the iterates is also obtained.

Sequential Convergence (Lojasiewicz Setting)

Notably, the developed machinery enables recovery of recent results on sequential convergence of SubGD under a $f$ 8 step-size schedule, matching and providing an alternative proof for Lai and Song (2025). This follows via the classical Kurdyka–Lojasiewicz inequality, now measured against the projected Riemannian gradients on strata, and the explicit control over the nonstationarity terms introduced by stratum switching.

Figure 4: Illustration of a full subgradient trajectory, with buffer-structured transitions among strata.

Theoretical and Practical Implications

Theoretical Significance

This work provides the first explicit, finite-time rates for subgradient descent on broad (semialgebraic/definable) nonsmooth, nonconvex functions, replacing previous asymptotic-only results and connecting the complexity of the algorithm directly to geometric invariants of the objective via stratification. The construction of global Lipschitz stratifications and the derivations of global projection properties represent a significant technical expansion of the geometric analysis toolkit available for nonsmooth optimization.

Practical Consequences

While the worst-case rates have exponential dimension dependence, the authors note this is not intrinsic, and practical problem instances (including deep neural networks, whose loss landscapes are semialgebraic upon compact restriction) often traverse strata of higher rank and lower effective complexity. The analysis reveals fine structure in the trajectory of SubGD, especially in architectures with ReLU-type nonsmoothness, and points toward principled ways to quantify optimization error and stationarity regardless of whether the encountered nonsmoothness is "accidental" or structural.

Open Directions and Future Developments

The framework suggests several future lines of investigation:

Dimension-free complexity: Whether geometric or algorithmic methods could yield dimension-independent rates (removing exponential dependence on $f$ 9) for broader function classes.
Extension to stochastic algorithms: Generalizing the analysis to stochastic subgradient methods, which are the norm in large-scale learning.
Sharpness via Lojasiewicz exponents: Incorporating path-dependent (trajectory-specific) Lojasiewicz exponents could yield sharper rates for certain regimes or networks.
Algorithmic exploitation of strata structure: Dynamic bias toward higher-rank strata (or explicit stratification-aware algorithms) might improve convergence in practice.

Conclusion

The paper establishes a rich, geometric convergence theory for subgradient descent on semialgebraic functions, founded on explicit control of the interaction between nonsmooth dynamics and stratified smooth structure. The techniques offer a powerful platform for both further theoretical exploration and practical error certification in modern nonsmooth optimization contexts.

Markdown Report Issue