Generalized Smoothness in Nonconvex Optimization

Updated 23 September 2025

The paper introduces the expected smoothness (ES) framework to extend classical Lipschitz smoothness for realistic stochastic gradients in nonconvex settings.
It establishes optimal convergence rates for SGD in both general nonconvex and Polyak–Łojasiewicz regimes, linking performance to problem-specific constants.
It offers actionable guidelines for minibatch and importance sampling strategies, validated through synthetic and real data experiments.

Generalized smoothness in nonconvex optimization encompasses a spectrum of concepts and technical frameworks that extend classical smoothness assumptions, such as global Lipschitz continuity of the gradient, to better capture the geometric and algorithmic realities in large-scale machine learning and related nonconvex applications. These generalized smoothness properties underpin modern convergence analyses, facilitate robust algorithm design, and enable precise sample complexity guarantees for first-order methods—especially stochastic gradient descent (SGD)—even when the objective function is neither convex nor classically smooth.

1. Expected Smoothness: Formulation and Motivation

The expected smoothness assumption (ES) was introduced to overcome the limitations of prior variance and growth conditions (such as strong growth or bounded variance) in stochastic optimization of nonconvex functions (Khaled et al., 2020). ES focuses on controlling the second moment of the stochastic gradient $g(x)$ and is defined as follows: there exist nonnegative constants $A$ , $B$ , and $C$ , such that for all $x$ ,

$\mathbb{E} [\|g(x)\|^2] \le 2A \cdot (f(x) - \inf) + B \cdot \|\nabla f(x)\|^2 + C.$

This condition is strictly weaker and hence more general than previous assumptions, since it does not force an interpolation property (i.e., it does not require $g(x)=0$ whenever $\nabla f(x)=0$ ), accommodates real sampling mechanisms (subsampling, compression), and matches how stochastic gradients behave in non-interpolating regimes or with nonconvex $f$ . The ES assumption is verified to be the weakest among a hierarchy of commonly used hypotheses in the nonconvex SGD literature.

2. Convergence Rates for SGD under Generalized Smoothness

The ES property directly enables optimal convergence rates for SGD in the nonconvex setting. Assuming $f$ is $L$ -smooth:

For general nonconvex functions, to achieve an $\varepsilon$ -stationary point (i.e., $\mathbb{E}[\|\nabla f(x)\|^2] \le \varepsilon^2$ ), SGD requires $O(\varepsilon^{-4})$ stochastic gradient evaluations. The precise step size and iteration requirements, from Corollary 4, are:

$\gamma = \min\left\{ \frac{1}{\sqrt{L A K}}, \frac{1}{L B}, \frac{\varepsilon}{2L C} \right\}, \quad K \geq (12 \delta_0 L / \varepsilon^2) \cdot \max\left\{ B, \frac{12 \delta_0 A}{\varepsilon^2}, \frac{2C}{\varepsilon^2} \right\},$

where $\delta_0 = f(x_0) - \inf$ .

If $f$ satisfies the Polyak–Łojasiewicz (PL) condition, i.e.,

$\frac{1}{2} \|\nabla f(x)\|^2 \ge \mu (f(x) - f^*),$

strong global convergence is recovered: the number of iterations to reach error less than $\varepsilon$ scales as $O(\varepsilon^{-1})$ . Theorem 6 gives a detailed rate showing that, for a suitable step size sequence,

$\mathbb{E}[f(x_K) - f^*] \le \frac{9 \kappa_f C}{2\mu K} + \exp\left\{-\frac{K}{2\kappa_f \max\{\kappa_S, B\}}\right\}(f(x_0)-f^*),$

where $\kappa_f = L/\mu$ , $\kappa_S = A/\mu$ .

Both in nonconvex and PL cases, the rates are optimal under first-order oracle complexity. The multiplicative and additive roles of the problem constants $(A, B, C)$ , the smoothness $L$ , and the PL parameter $\mu$ are explicit in the bounds.

3. Role of Functional Growth and Regime Comparison

The analysis delineates convex, nonconvex, and PL regimes. Under convexity (or the more general quadratic functional growth condition, QFG: $f(x)-f^* \ge (\mu/2) \|x-x^*\|^2$ ), SGD with the ES property achieves even larger step sizes ( $O(1/L)$ ), and the rates show additive dependencies on conditioning parameters, e.g., $(\kappa_f + \kappa_S)$ . Specifically, under QFG and ES, Theorem 7 establishes: $\mathbb{E}[f(x_K)-f^*] \le \frac{18\kappa_f C}{\mu K} + \frac{\kappa_f}{2}\exp(-K/M)(f(x_0)-f^*)$ with convergence in function value significantly accelerated compared to general nonconvex settings.

A key insight is the transition from multiplicative dependence on $A,B,C$ in the nonconvex regime to additive dependence in the convex/QFG/PL regime, which sharpens sample complexity and informs practical stepsize selection.

4. Minibatch Sampling and Importance Sampling Strategies

The analysis is performed in a general stochastic framework that covers numerous finite-sum and sampling algorithms. Given a finite sum structure $f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x)$ , and an unbiased sampling vector $v$ , the stochastic gradient is $g(x) = \nabla f_v(x)$ . For several practical schemes, such as:

Independent sampling with or without replacement,
$\tau$ -nice sampling indicating minibatch size $\tau$ ,

the paper provides explicit, computable values for the ES constants $A, B, C$ . For example, under i.i.d. sampling with replacement: $A = \max_i \frac{L_i}{\tau n q_i},\quad B = 1 - 1/\tau,\quad C = 2A \Delta^{\inf},$ with $q_i$ the sampling probability of index $i$ and $L_i$ the smoothness constant of $f_i$ . These explicit formulas form the basis for designing efficient importance sampling strategies. In particular, the theoretically optimal sampling probabilities satisfy $q_i^* = L_i / \sum_j L_j$ .

Additionally, balancing the computational cost per iteration, the optimal minibatch size is calculated as: $\tau^* = 1 + \lfloor (D \bar{L}) / \varepsilon^2 \rfloor,$ where $\bar{L} = (1/n)\sum L_i$ and $D$ is a problem-specific constant.

5. Practical Validation and Empirical Calibration

Two experimental studies corroborate the theoretical results and the generalized smoothness framework:

In synthetic nonconvex linear regression with strong nonuniform smoothness ( $L_i$ vary across coordinates), importance sampling (with optimal $q_i^*$ ) outpaces uniform sampling, while in normalized settings both perform similarly.
On real data (a9a set), the fit of the ES model is assessed by recording the losses, full and stochastic gradient norms, and then fitting $(2A, B, C)$ via nonnegative least squares to the inequality. The fitted constants are shown to match theoretical predictions and outperform the alternative relaxed growth (RG) model. The ES relation provides a tight, data-driven description of stochastic gradient behavior.

6. Implications and Algorithmic Guidance

The generalized smoothness (ES) concept unifies theory and practice for nonconvex finite-sum stochastic optimization:

The ES condition models realistic noise, subsampling, and data-dependent gradient variation;
Optimal SGD convergence rates are achieved and explicitly linked to problem-dependent constants, with principled step size and sampling rules;
The results hold in both convex/PL and nonconvex settings, with a sharp distinction in the impact of sampling and conditioning;
The analysis provides actionable guidelines for selecting minibatch sizes and sampling distributions, directly informed by problem geometry;
Empirical results validate the modeling both in synthetic and real datasets.

This framework robustly advances the state-of-the-art in generalized smoothness analysis for nonconvex optimization, consolidating practical algorithm design with matched theoretical guarantees (Khaled et al., 2020).

PDF Markdown Chat (Pro)

References (1)

Better Theory for SGD in the Nonconvex World (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Generalized Smoothness in Nonconvex Optimization.