Nonconvex Finite-Sum Optimization

Updated 7 May 2026

Nonconvex finite-sum problems are optimization tasks defined by the average of L-smooth, potentially nonconvex functions, central to large-scale applications.
They leverage advanced variance reduction, stochastic approximation, and adaptive sampling to achieve convergence to first-order stationary points.
Recent innovations like SCSG, VRFW, and adaptive trust-region methods offer tighter complexity bounds and improved empirical performance.

Nonconvex finite-sum problems consist of optimization objectives expressed as averages of individual nonconvex functions, a structure central to large-scale machine learning, signal processing, and statistics. These problems exhibit challenging landscape geometry, lack of global convexity, and pronounced sensitivity to algorithmic choices and variance reduction techniques. The archetypal formulation is

$\min_{x \in \mathbb{R}^d}\; f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x)$

where each $f_i:\mathbb{R}^d \rightarrow \mathbb{R}$ is possibly nonconvex and $L$ -smooth. The field has advanced through complex algorithmic frameworks combining stochastic approximation, variance reduction, adaptivity, and parallelism, accompanied by sharp theoretical complexity bounds and demonstrated empirical impact.

1. Problem Structure and Complexity Measures

In nonconvex finite-sum optimization, the central theoretical benchmark is convergence to first-order stationary points, i.e., producing an $x$ such that $\mathbb{E} \|\nabla f(x)\|^2 \leq \epsilon$ . The challenge arises from the aggregation of component-wise nonconvexity:

Each $f_i$ is $L$ -smooth, i.e., $\|\nabla f_i(x) - \nabla f_i(y)\| \leq L \|x - y\|$ .
The variance of stochastic gradients is quantified by $H^* = \sup_{x} \frac{1}{n}\sum_{i=1}^{n} \|\nabla f_i(x) - \nabla f(x)\|^2$ .
Complexity is measured as the number of component gradient evaluations (IFO calls) required to achieve $\mathbb{E} \|\nabla f(x)\|^2 \leq \epsilon$ .

Tighter complexity is achieved via variance reduction, mini-batching, adaptivity in sampling, and sophisticated momentum and proximal operations.

2. Core Algorithmic Paradigms

2.1 Stochastically Controlled Stochastic Gradient Methods

The SCSG method (Lei et al., 2017) interpolates between pure SGD and full-batch variance-reduced methods (SVRG/SAGA), leveraging mini-batch snapshots and an adaptive geometric number of inner stochastic steps to attain tight complexity:

At each epoch $f_i:\mathbb{R}^d \rightarrow \mathbb{R}$ 0, a reference mini-batch $f_i:\mathbb{R}^d \rightarrow \mathbb{R}$ 1 of size $f_i:\mathbb{R}^d \rightarrow \mathbb{R}$ 2 is used to compute

$f_i:\mathbb{R}^d \rightarrow \mathbb{R}$ 3

A random number of SVRG-style corrected stochastic steps is applied before the next reference update.
This stochastically controlled epoch structure allows achieving

$f_i:\mathbb{R}^d \rightarrow \mathbb{R}$ 4

SCSG strictly outperforms standard SGD $f_i:\mathbb{R}^d \rightarrow \mathbb{R}$ 5 and matches or improves over SVRG/SAGA except in the extremely high-accuracy regime ( $f_i:\mathbb{R}^d \rightarrow \mathbb{R}$ 6).

2.2 Variance Reduction and Acceleration

State-of-the-art methods incorporate variance reduction with adaptive mechanisms for optimal performance:

Variance Reduced Frank–Wolfe (VRFW): Projection-free optimization with variance reduction achieves $f_i:\mathbb{R}^d \rightarrow \mathbb{R}$ 7 gradient complexity for the nonconvex gap (Reddi et al., 2016).
Adaptive Extrapolated Proximal Gradient (AAPG/AAPG-SPIDER): Integrates Nesterov-type extrapolation, adaptive stepsizes, and SPIDER variance reduction to achieve Lipschitz-free, learning-rate-free optimal complexity: $f_i:\mathbb{R}^d \rightarrow \mathbb{R}$ 8 for composite nonconvex problems (Yuan, 28 Feb 2025).
Random Reshuffling: norm-PRR (proximal random reshuffling) substantially improves sample complexity over classical Prox-SGD in the nonsmooth nonconvex setting, achieving $f_i:\mathbb{R}^d \rightarrow \mathbb{R}$ 9, and exact linear convergence under PL geometry and interpolation (Qiu et al., 2023).
Superlinear/quasi-Newton increments: SPIRAL and incremental quasi-Newton approaches combine second-order information and incremental updates for rapid local acceleration (Behmandpoor et al., 2022, Yalcin et al., 2022).

2.3 Proximal, Primal–Dual, and Block-Coordinate Approaches

Structured nonconvex finite-sum problems frequently require handling nonsmooth regularizers and constraints:

Primal–dual preconditioning enables efficient solution of composite objectives with linear operators, even when $L$ 0 is nonsmooth and nonconvex (Guo et al., 2023).
Block-coordinate and incremental aggregated proximal methods (including Bregman variants) achieve global convergence and, under the KL property, linear rates, without requiring separability or convexity of $L$ 1 (Latafat et al., 2019, Latafat et al., 2021).
Stochastic ADMM with variance reduction and acceleration provides sublinear, and under KL geometry, linear convergence for nonconvex, nonsmooth, linearly constrained finite sums (Zeng et al., 2023).

2.4 Trust-Region and Adaptive Sampling

Adaptive trust-region strategies can robustly handle the nonconvex landscape and data heterogeneity:

Adaptive Sample Size Trust-Region (ASTR) methods guarantee finite transition to a full-batch regime and inherit global convergence properties from classical trust-region theory, with empirical advantages in neural network training (Mohr et al., 2019).
Additional sampling penalty methods for nonlinear equality constraints (ASPEN) adaptively tune batch size, avoiding costly projections while guaranteeing almost-sure convergence (Krejić et al., 4 Aug 2025).

2.5 Large-Scale, Parallel, and Zeroth-Order Regimes

Modern applications impose system-level constraints and operator limitations:

Freya PAGE: Achieves optimal time complexity for large-scale, nonconvex finite sums under asynchronous, heterogeneous compute resources, ignoring stragglers and optimally robust to arbitrary delays (Tyurin et al., 2024).
Zeroth-order Stochastic Frank–Wolfe: Employs double variance reduction to achieve the best known $L$ 2 query complexity in high-dimensional constrained nonconvex finite-sum problems (Ye et al., 13 Jan 2025).

3. Complexity Bounds and Theoretical Guarantees

A spectrum of tight complexity results has been established under various regularity and geometric assumptions:

Method	Complexity for $L$ 3	Assumptions/Regime	arXiv ID
SCSG	$L$ 4	Smooth, nonconvex	(Lei et al., 2017)
VRFW (SAGA/SVRG)	$L$ 5	Smooth, nonconvex, compact constraint	(Reddi et al., 2016)
AAPG–SPIDER	$L$ 6	Composite, adaptive, Lipschitz-free	(Yuan, 28 Feb 2025)
norm-PRR RR	$L$ 7	Nonsmooth, nonconvex	(Qiu et al., 2023)
RapGrad	$L$ 8	Prox-point, negative curvature	(Lan et al., 2018)
Geom-SPIDER-EM	$L$ 9	VR stochastic EM, latent variable	(Fort et al., 2020)
Freya PAGE (asynchronous time)	$x$ 0	Distributed, heterogeneous	(Tyurin et al., 2024)

Under the Polyak–Łojasiewicz (PL) condition, SCSG and related methods enjoy geometric (linear) convergence in objective value (Lei et al., 2017).
Under the KL property, block-coordinate/aggregated methods guarantee either sublinear or global linear convergence, with explicit rates determined by the KL exponent (Latafat et al., 2019, Latafat et al., 2021).
For nonsmooth and composite objectives, the presence of a regularized prox operator $x$ 1 is handled efficiently without sacrificing rate optimality (Yuan, 28 Feb 2025).
Random reshuffling methods for nonsmooth nonconvex sums match or surpass batch reference methods in nonasymptotic and last-iterate rates under PL and KL geometries (Qiu et al., 2023).

4. Extensions: Constraints, Nonsmoothness, and Networked Optimization

Many application settings feature additional structure:

Nonlinear equality/affine constraints: Quadratic penalty with adaptive and additional sampling provides projection-free optimization with almost sure convergence to KKT points (Krejić et al., 4 Aug 2025).
Nonsmooth summands and regularizers: Block-coordinate, incremental, and Bregman-based approaches allow handling non-Lipschitz gradients and general nonseparable nonsmooth terms, including for the popular Finito/MISO and incremental quasi-Newton frameworks (Latafat et al., 2019, Latafat et al., 2021, Yalcin et al., 2022).
Decentralized and federated settings: Decentralized stochastic minimax algorithms with variance reduction achieve the first linear convergence rates for finite-sum nonconvex–nonconcave minimax problems under networked communication and PL-type geometry (Zhang et al., 2023).
Zeroth-order oracles: Double variance reduction enables projection-free constrained nonconvex finite-sum minimization when gradients are unavailable, attaining state-of-the-art query complexity (Ye et al., 13 Jan 2025).

5. Empirical and Practical Aspects

Extensive empirical validation substantiates the theoretical developments:

SCSG and its variants consistently achieve faster reduction in training and validation loss on deep neural networks compared to standard SGD, especially with adaptive mini-batch schedules (Lei et al., 2017).
ASPEN and ASTR demonstrate cost savings and rapid initial descent by automatically tuning the sample size, particularly effective for ill-conditioned or heterogeneous data (Mohr et al., 2019, Krejić et al., 4 Aug 2025).
Incremental quasi-Newton and superlinear-type algorithms outperform state-of-the-art bundle and stochastic methods on nonsmooth, nonconvex classification tasks, while maintaining computational efficiency when $x$ 2 (Yalcin et al., 2022).
Adaptive extrapolation and variance reduction strategies in AAPG–SPIDER obviate the need for manual stepsize tuning and match or improve performance over baseline VR methods on a range of sparse recovery and eigenvalue problems (Yuan, 28 Feb 2025).
Parallel and distributed methods like Freya PAGE deliver optimal time-to-stationarity in highly heterogeneous or straggler-prone hardware environments (Tyurin et al., 2024).

6. Directions and Open Challenges

Recent progress raises new directions in nonconvex finite-sum optimization:

Extending convergence and complexity guarantees to broader classes, such as non-Lipschitz smoothness (via relative smoothness) or nonconvex, nonsmooth regularizers.
Full theory for overlapping mini-batch momentum and persistency in stochastic line-search frameworks; current models exhibit empirical gains but lack general convergence proofs (Lapucci et al., 2024).
Distributed and federated optimization under communication and privacy constraints, particularly for nonconvex–nonconcave or source-heterogeneous regimes.
Incorporation of adaptive, straggler-robust, and importance-sampling variants into large-scale practical implementations.
Exploiting higher-order or quasi-Newton techniques with variance reduction for scalable acceleration, and establishing global superlinear convergence.

The field continues to innovate at the interface of optimization theory, algorithmic design, and large-scale, real-world data science, with a persistent emphasis on rigorously quantifying performance under nonconvexity, data heterogeneity, and system complexity.