Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 73 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 94 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Claude Sonnet 4.5 33 tok/s Pro
2000 character limit reached

Generalized Smoothness in Nonconvex Optimization

Updated 23 September 2025
  • The paper introduces the expected smoothness (ES) framework to extend classical Lipschitz smoothness for realistic stochastic gradients in nonconvex settings.
  • It establishes optimal convergence rates for SGD in both general nonconvex and Polyak–Łojasiewicz regimes, linking performance to problem-specific constants.
  • It offers actionable guidelines for minibatch and importance sampling strategies, validated through synthetic and real data experiments.

Generalized smoothness in nonconvex optimization encompasses a spectrum of concepts and technical frameworks that extend classical smoothness assumptions, such as global Lipschitz continuity of the gradient, to better capture the geometric and algorithmic realities in large-scale machine learning and related nonconvex applications. These generalized smoothness properties underpin modern convergence analyses, facilitate robust algorithm design, and enable precise sample complexity guarantees for first-order methods—especially stochastic gradient descent (SGD)—even when the objective function is neither convex nor classically smooth.

1. Expected Smoothness: Formulation and Motivation

The expected smoothness assumption (ES) was introduced to overcome the limitations of prior variance and growth conditions (such as strong growth or bounded variance) in stochastic optimization of nonconvex functions (Khaled et al., 2020). ES focuses on controlling the second moment of the stochastic gradient g(x)g(x) and is defined as follows: there exist nonnegative constants %%%%1%%%%, BB, and CC, such that for all xx,

E[g(x)2]2A(f(x)inf)+Bf(x)2+C.\mathbb{E} [\|g(x)\|^2] \le 2A \cdot (f(x) - \inf) + B \cdot \|\nabla f(x)\|^2 + C.

This condition is strictly weaker and hence more general than previous assumptions, since it does not force an interpolation property (i.e., it does not require g(x)=0g(x)=0 whenever f(x)=0\nabla f(x)=0), accommodates real sampling mechanisms (subsampling, compression), and matches how stochastic gradients behave in non-interpolating regimes or with nonconvex ff. The ES assumption is verified to be the weakest among a hierarchy of commonly used hypotheses in the nonconvex SGD literature.

2. Convergence Rates for SGD under Generalized Smoothness

The ES property directly enables optimal convergence rates for SGD in the nonconvex setting. Assuming ff is LL-smooth:

  • For general nonconvex functions, to achieve an ε\varepsilon-stationary point (i.e., E[f(x)2]ε2\mathbb{E}[\|\nabla f(x)\|^2] \le \varepsilon^2), SGD requires O(ε4)O(\varepsilon^{-4}) stochastic gradient evaluations. The precise step size and iteration requirements, from Corollary 4, are:

γ=min{1LAK,1LB,ε2LC},K(12δ0L/ε2)max{B,12δ0Aε2,2Cε2},\gamma = \min\left\{ \frac{1}{\sqrt{L A K}}, \frac{1}{L B}, \frac{\varepsilon}{2L C} \right\}, \quad K \geq (12 \delta_0 L / \varepsilon^2) \cdot \max\left\{ B, \frac{12 \delta_0 A}{\varepsilon^2}, \frac{2C}{\varepsilon^2} \right\},

where δ0=f(x0)inf\delta_0 = f(x_0) - \inf.

  • If ff satisfies the Polyak–Łojasiewicz (PL) condition, i.e.,

12f(x)2μ(f(x)f),\frac{1}{2} \|\nabla f(x)\|^2 \ge \mu (f(x) - f^*),

strong global convergence is recovered: the number of iterations to reach error less than ε\varepsilon scales as O(ε1)O(\varepsilon^{-1}). Theorem 6 gives a detailed rate showing that, for a suitable step size sequence,

E[f(xK)f]9κfC2μK+exp{K2κfmax{κS,B}}(f(x0)f),\mathbb{E}[f(x_K) - f^*] \le \frac{9 \kappa_f C}{2\mu K} + \exp\left\{-\frac{K}{2\kappa_f \max\{\kappa_S, B\}}\right\}(f(x_0)-f^*),

where κf=L/μ\kappa_f = L/\mu, κS=A/μ\kappa_S = A/\mu.

Both in nonconvex and PL cases, the rates are optimal under first-order oracle complexity. The multiplicative and additive roles of the problem constants (A,B,C)(A, B, C), the smoothness LL, and the PL parameter μ\mu are explicit in the bounds.

3. Role of Functional Growth and Regime Comparison

The analysis delineates convex, nonconvex, and PL regimes. Under convexity (or the more general quadratic functional growth condition, QFG: f(x)f(μ/2)xx2f(x)-f^* \ge (\mu/2) \|x-x^*\|^2), SGD with the ES property achieves even larger step sizes (O(1/L)O(1/L)), and the rates show additive dependencies on conditioning parameters, e.g., (κf+κS)(\kappa_f + \kappa_S). Specifically, under QFG and ES, Theorem 7 establishes: E[f(xK)f]18κfCμK+κf2exp(K/M)(f(x0)f)\mathbb{E}[f(x_K)-f^*] \le \frac{18\kappa_f C}{\mu K} + \frac{\kappa_f}{2}\exp(-K/M)(f(x_0)-f^*) with convergence in function value significantly accelerated compared to general nonconvex settings.

A key insight is the transition from multiplicative dependence on A,B,CA,B,C in the nonconvex regime to additive dependence in the convex/QFG/PL regime, which sharpens sample complexity and informs practical stepsize selection.

4. Minibatch Sampling and Importance Sampling Strategies

The analysis is performed in a general stochastic framework that covers numerous finite-sum and sampling algorithms. Given a finite sum structure f(x)=1ni=1nfi(x)f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x), and an unbiased sampling vector vv, the stochastic gradient is g(x)=fv(x)g(x) = \nabla f_v(x). For several practical schemes, such as:

  • Independent sampling with or without replacement,
  • τ\tau-nice sampling indicating minibatch size τ\tau,

the paper provides explicit, computable values for the ES constants A,B,CA, B, C. For example, under i.i.d. sampling with replacement: A=maxiLiτnqi,B=11/τ,C=2AΔinf,A = \max_i \frac{L_i}{\tau n q_i},\quad B = 1 - 1/\tau,\quad C = 2A \Delta^{\inf}, with qiq_i the sampling probability of index ii and LiL_i the smoothness constant of fif_i. These explicit formulas form the basis for designing efficient importance sampling strategies. In particular, the theoretically optimal sampling probabilities satisfy qi=Li/jLjq_i^* = L_i / \sum_j L_j.

Additionally, balancing the computational cost per iteration, the optimal minibatch size is calculated as: τ=1+(DLˉ)/ε2,\tau^* = 1 + \lfloor (D \bar{L}) / \varepsilon^2 \rfloor, where Lˉ=(1/n)Li\bar{L} = (1/n)\sum L_i and DD is a problem-specific constant.

5. Practical Validation and Empirical Calibration

Two experimental studies corroborate the theoretical results and the generalized smoothness framework:

  • In synthetic nonconvex linear regression with strong nonuniform smoothness (LiL_i vary across coordinates), importance sampling (with optimal qiq_i^*) outpaces uniform sampling, while in normalized settings both perform similarly.
  • On real data (a9a set), the fit of the ES model is assessed by recording the losses, full and stochastic gradient norms, and then fitting (2A,B,C)(2A, B, C) via nonnegative least squares to the inequality. The fitted constants are shown to match theoretical predictions and outperform the alternative relaxed growth (RG) model. The ES relation provides a tight, data-driven description of stochastic gradient behavior.

6. Implications and Algorithmic Guidance

The generalized smoothness (ES) concept unifies theory and practice for nonconvex finite-sum stochastic optimization:

  • The ES condition models realistic noise, subsampling, and data-dependent gradient variation;
  • Optimal SGD convergence rates are achieved and explicitly linked to problem-dependent constants, with principled step size and sampling rules;
  • The results hold in both convex/PL and nonconvex settings, with a sharp distinction in the impact of sampling and conditioning;
  • The analysis provides actionable guidelines for selecting minibatch sizes and sampling distributions, directly informed by problem geometry;
  • Empirical results validate the modeling both in synthetic and real datasets.

This framework robustly advances the state-of-the-art in generalized smoothness analysis for nonconvex optimization, consolidating practical algorithm design with matched theoretical guarantees (Khaled et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Generalized Smoothness in Nonconvex Optimization.