Second-Order Stationary Point (SOSP)

Updated 23 October 2025

Second-Order Stationary Point (SOSP) is a condition in nonconvex optimization characterized by nearly zero gradients and a Hessian that is nearly positive-semidefinite within set tolerances.
Efficient methods such as two-stage algorithms, trust-region approaches, and negative curvature exploitation are deployed to converge to SOSP in both deterministic and stochastic settings.
Research shows optimal convergence rates under various constraints while addressing challenges like NP-hard verification, robust noise handling, and differential privacy.

A second-order stationary point (SOSP) is a critical notion in the analysis and algorithmic treatment of nonconvex optimization, serving as a fundamental definition of approximate local optimality. An SOSP, in contrast to a first-order stationary point (FOSP), certifies both that the gradient is nearly zero and that the Hessian is nearly positive-semidefinite, up to prescribed tolerances. In modern theoretical optimization and machine learning, achieving convergence to an SOSP is central to obtaining non-asymptotic guarantees of robust local optimality and for escaping saddle points in both unconstrained and constrained, deterministic and stochastic, and even privacy-preserving settings.

1. Mathematical Definition and Optimality Conditions

Let $f:\mathbb{R}^d \to \mathbb{R}$ be a twice-differentiable function, potentially subject to constraints such as $x \in \mathcal{C}$ . The $(\epsilon_g, \epsilon_H)$ -second-order stationary point (( $\epsilon_g$ , $\epsilon_H$ )-SOSP) is defined as a point $x$ satisfying

$\|\nabla f(x)\| \le \epsilon_g,\qquad \lambda_{\mathrm{min}}\left(\nabla^2 f(x)\right) \ge -\epsilon_H$

in unconstrained problems, or, for constrained problems (e.g., $x \in \mathcal{C}$ ), where directional and tangent-space generalizations are used, as follows:

First-order stationarity: for all feasible directions $d$ (i.e., $d$ satisfying $x + d \in \mathcal{C}$ , often $d \in T_{\mathcal{C}}(x)$ ),

$\nabla f(x)^\top d \ge -\epsilon_g$
Second-order stationarity: for all feasible $d$ with $\nabla f(x)^\top d = 0$ ,

$d^\top \nabla^2 f(x) d \ge -\epsilon_H\|d\|^2$

The $(\epsilon, \sqrt{\epsilon})$ -SOSP scaling and related parameter choices are prevailing in both theoretical and applied work (Nouiehed et al., 2018, He et al., 2022).

2. Complexity of Achieving an SOSP: Thresholds and Hardness

In unconstrained smooth finite-sum or stochastic settings, finding an $(\epsilon, \sqrt{\epsilon})$ -SOSP requires $\widetilde{\mathcal{O}}(\epsilon^{-3})$ stochastic gradient and Hessian-vector product evaluations; this is unimprovable even for $p$ th order methods with $p \geq 2$ (Arjevani et al., 2020). In deterministic settings, Newton-type or cubic-regularized methods achieve similar rates for unconstrained or “simple” constrained problems.

For general constrained nonconvex problems, checking if a point is an $(\epsilon_g, \epsilon_H)$ -SOSP is NP-hard even with simple linear constraints (Nouiehed et al., 2018). Efficient algorithms (e.g., dynamic second-order Frank–Wolfe, SNAP, and Newton–CG-based barrier or augmented Lagrangian methods) exist if the constrained second-order subproblem (i.e., searching for negative curvature satisfying constraints) can be solved (or approximated) efficiently (Nouiehed et al., 2018, Lu et al., 2019, He et al., 2022, He et al., 2023).

The following table summarizes the best-known rates for unconstrained and various constrained settings:

Setting	Complexity (iterations/oracle calls)	Key Reference
Unconstrained, deterministic	$\mathcal{O}(\epsilon^{-3/2})$	(He et al., 2022)
Unconstrained, stochastic	$\mathcal{O}(\epsilon^{-3})$	(Arjevani et al., 2020)
Linearly constrained	$\widetilde{\mathcal{O}}(\max\{\epsilon_g^{-3/2},\epsilon_H^{-3}\})$	(Nouiehed et al., 2019)
Polyhedral constraints, SNAP algorithms	$\mathcal{O}(1/\epsilon^{2.5})$	(Lu et al., 2019)
Conic/equality constrained (barrier-AL/AL)	$\widetilde{\mathcal{O}}(\epsilon^{-7/2})$ (GLICQ)	(He et al., 2022 He et al., 2023 He et al., 2023)

These rates are optimal or nearly optimal upto logarithmic factors where applicable.

3. Algorithmic Frameworks for Convergence to SOSP

Two-Stage (“First-Order then Second-Order”) Methods

Frameworks such as those in (Mokhtari et al., 2018) and (Nouiehed et al., 2018) iterate in two phases:

First-Order Phase: Use a standard first-order method (e.g., projected/conditional gradient, gradient descent) until a first-order condition is met (i.e., $\|\nabla f(x)\| \le \epsilon$ or directional derivatives over the constraint set exceed $-\epsilon$ ).
Second-Order Phase: Approximately solve a quadratic (often constrained) subproblem:

$\min_{d \in \mathcal{C},\, \nabla f(x)^\top d = 0,\; \|d\|\leq 1} d^\top \nabla^2 f(x) d$

If a sufficiently negative curvature direction is found ( $d^\top \nabla^2 f(x) d < -\gamma$ ), take a second-order step; otherwise, declare an (approximate) SOSP.

The principle extends to the stochastic context by switching to mini-batch gradient and Hessian estimates and relaxing the linear constraints (Mokhtari et al., 2018).

Trust-Region and Newton-Type Methods

Trust-region approaches (e.g., LC-TRACE (Nouiehed et al., 2019), Newton–CG barrier (He et al., 2022)) solve, at each iteration,

$\min_s f(x) + g^\top s + \frac{1}{2} s^\top H s,\quad\text{subject to constraints and } \|s\| \leq \delta$

modifying the acceptance rules and step size control to handle constraints and guarantee sufficient second-order decrease. Cubic regularization (adding a term $+\frac{\sigma}{3}\|s\|^3$ ) can either be directly used or substituted with a more elaborate radius update (as in (Nouiehed et al., 2019)).

Negative Curvature Pursuit

Algorithms such as SNAP and SNAP $^+$ (Lu et al., 2019) alternate between projected gradient descent steps and negative curvature descent in the free subspace defined by active constraints, using either explicit Hessian-vector computations or first-order estimators.

Barrier–Augmented Lagrangian and Interior Point Methods

Recent advances for general conic and nonlinear semidefinite programming settings employ interior point techniques with negative curvature steps or combine a conic barrier term with an augmented Lagrangian on the nonlinear constraints. Preconditioned Newton–CG solvers, randomized minimum eigenvalue oracles, and sophisticated line searches underpin these methods, achieving sample and operation complexity optimality (He et al., 2022, He et al., 2023, Arahata et al., 2021).

4. Extensions: Stochastic, Robust, and Private SOSP Optimization

Stochastic Optimization

Stochastic variants interleave gradient descent steps with negative curvature exploitation (e.g., via Oja’s method) using only access to stochastic gradient and Hessian-vector oracles. Sample complexity is sharply characterized, with $\mathcal{O}(\epsilon^{-3})$ or $\mathcal{O}(\max\{\epsilon^{-4},\gamma^{-7}\})$ stochastic evaluations needed (Arjevani et al., 2020, Mokhtari et al., 2018).

Robust SOSP under Data Contamination

In high-dimensional, heavy noise regimes (e.g., strong contamination or adversarially corrupted observations), robust estimation of gradients and Hessians is essential. The sample complexity for robustly finding an $(O(\sqrt{\epsilon}),O(\sqrt{\epsilon}))$ -SOSP under a fraction $\epsilon$ corrupted data is $\widetilde{O}(D^2/\epsilon)$ , where $D$ is the ambient dimension. This quadratic dependence arises from the necessity of robustly estimating the Hessian (Li et al., 12 Mar 2024); dimension-independent guarantees cannot be attained using only gradient information.

Differential Privacy and SOSPs

Achieving convergence to an SOSP under $(\epsilon,\delta)$ -differential privacy is nontrivial: privacy noise can obscure negative curvature, and naive privatized model selection procedures degrade solution quality, especially in distributed or high-dimensional settings. Modern algorithms incorporate:

Adaptive batch sizes and binary/tree mechanisms for noise addition (to limit cumulative error) (Liu et al., 10 Oct 2024).
Perturbed SGD (Gauss-PSGD) with careful monitoring of “model drift” to ascertain escape from saddle regions without extra private selection (Tao et al., 21 May 2025).
Adaptive DP-SPIDER gradient estimators balancing privacy/noise and variance for robust convergence. Recent advances achieve an $\alpha$ -SOSP bound matching that of FOSP ( $\widetilde{O}(1/n^{1/3}+(\sqrt{d}/n\epsilon)^{1/2}))$ , removing the previously conjectured additional complexity for SOSP relative to FOSP (Liu et al., 10 Oct 2024, Tao et al., 21 May 2025).

5. SOSP in Structured Problems and Applications

Reinforcement Learning

In nonconvex policy optimization, the SOSP concept is central for avoiding convergence to suboptimal saddles or poor local minima; only under the SOSP condition (with appropriately controlled gradient and maximal Hessian eigenvalue) can a point be certified to be nearly locally optimal. For risk-sensitive policy optimization (maximizing distortion riskmetrics or coherent risk measures), cubic-regularized policy Newton algorithms with efficient sample-based DRM Hessian estimators guarantee convergence to $\epsilon$ -SOSP in $\mathcal{O}(\epsilon^{-3.5})$ trajectories (Pachal et al., 10 Aug 2025). These methods utilize likelihood ratio–based gradient/Hessian estimation and variance reduction via Hessian vector products (bypassing importance sampling), with sample complexity advances over earlier approaches (Yang et al., 2020, Khorasani et al., 2023).

Structured Pruning and Deep Networks

In deep model pruning, saliency methods such as SOSP-H use second-order Taylor expansion and efficiently aggregate cross-structure correlations via Hessian–vector products, yielding global sensitivity scores that enable scalable, global-structure pruning and reveal architectural bottlenecks (Nonnenmacher et al., 2021).

Bounded-Rank and Manifold-Constrained Optimization

Optimization over nonsmooth sets such as the variety of bounded-rank matrices introduces difficulties: standard stationarity measures may vanish along a sequence without attaining genuine stationary points. The trust-region approach on a smooth lift (e.g., parameterization $X = LR^\top$ on a manifold) ensures that accumulation points are second-order critical for the lifted problem, which then imply stationarity for the original problem (Levin et al., 2021).

6. Implications and Open Problems

For unconstrained and (certain) constrained problems where second-order subproblems are efficiently solvable, achieving SOSP is just as tractable as FOSP with optimal rates ⎯ provided Hessian-vector products or negative curvature oracles are efficiently available.
For general constraint sets, the NP-hardness of SOSP verification necessitates structural assumptions (e.g., a fixed number of linear constraints, strict complementarity).
Robust SOSP attainment in adversarial contamination is fundamentally limited by the need for high-dimensional robust Hessian estimation, reflected in the quadratic sample complexity lower bound (Li et al., 12 Mar 2024).
In differential privacy, adaptive variance reduction and nontrivial privacy noise management allow achieving second-order convergence rates “for free” (compared to first order) in theory, provided additional model selection steps are avoided (Liu et al., 10 Oct 2024, Tao et al., 21 May 2025).
The design of further scalable, robust, and black-box second-order optimization methods that sidestep the computational intractability of second-order condition verification in the broadest settings remains a prominent area for future research.

7. Summary Table: Principal SOSP Rates by Setting

Class of Problem	Best-known Rate	Source
Unconstrained smooth (deterministic)	$\mathcal{O}(\epsilon^{-3/2})$	(He et al., 2022)
Stochastic unconstrained	$\mathcal{O}(\epsilon^{-3})$	(Arjevani et al., 2020)
Linear/Polyhedral constraint	$\widetilde{\mathcal{O}}(\epsilon_g^{-2}+\epsilon_H^{-3})$	(Nouiehed et al., 2018, Nouiehed et al., 2019)
General conic/equality constraint	$\widetilde{\mathcal{O}}(\epsilon^{-7/2})$ (GLICQ)	(He et al., 2022, He et al., 2023)
Robust adversarial contamination	$\widetilde{O}(D^2/\epsilon)$ samples	(Li et al., 12 Mar 2024)
Differential privacy (central/dist.)	$\widetilde{O}(1/n^{1/3} + (\sqrt{d}/n\epsilon)^{1/2})$	(Liu et al., 10 Oct 2024, Tao et al., 21 May 2025)
Risk-sensitive policy optimization	$\mathcal{O}(\epsilon^{-3.5})$	(Pachal et al., 10 Aug 2025)
Deep model pruning (structured)	Hessian–vector product cost; matches 1st order	(Nonnenmacher et al., 2021)