Stochastic Alternating Minimization

Updated 27 January 2026

Stochastic alternating minimization is a method that decomposes optimization problems into variable blocks or objectives and updates them using stochastic gradient steps.
The approach employs alternating block-wise updates with projections, handling nonconvex, nonsmooth, and large-scale settings efficiently.
Empirical studies demonstrate its competitive convergence rates and scalability in diverse applications like sparse phase retrieval and neural network training.

A stochastic alternating minimization algorithm is a class of iterative methods for solving composite or structured optimization problems by alternately optimizing different blocks, components, or objectives, employing stochastic (mini-batch or random-sampled) steps in place of exact, full gradients. This paradigm is distinctive for decoupling optimization variables or objectives, leveraging stochastic oracle access for scalability, and supporting a wide range of problem structures including multi-objective, nonconvex, nonsmooth, and large-scale settings.

1. Core Principle and Formal Definition

Stochastic alternating minimization (SAM) algorithms decompose an optimization problem, typically either through variable blocks or multi-objective scalarization, and update each block or objective sequentially using stochastic approximations. The general problem is expressed as:

Multi-objective: minimize $F(x) = (f_1(x), f_2(x))^\top$ , subject to $x \in \mathcal X$ ,
Composite objective: minimize $f(x) + H(x, y) + g(y),$ with $x \in \mathbb R^l,\, y\in\mathbb R^m$ .

At each iteration, a predetermined or randomized number of stochastic gradient (or subgradient) steps are performed on each component, alternating between the different parts and possibly projecting onto constraints. For stochastic bi-objective problems, a typical iteration with effort parameters $n_1$ , $n_2$ is as follows (Liu et al., 2022):

Apply $n_1$ stochastic gradient steps for $f_1$ ;
Apply $n_2$ stochastic gradient steps for $f_2$ ;
Project the result onto $\mathcal X$ .

Let $\lambda = n_1 / (n_1 + n_2)$ . Then, minimizing the scalarization $S(x, \lambda) = \lambda f_1(x) + (1-\lambda) f_2(x)$ characterizes a Pareto-optimal solution in convex cases.

2. Algorithmic Variants and Structure

Stochastic alternating minimization encompasses several notable algorithmic forms:

Block-coordinate stochastic alternating minimization:

Updates variables alternately, using stochastic gradients. Each block is updated while fixing the others, employing either one or several stochastic steps per block (Yan et al., 6 Aug 2025, Guo et al., 2023).

Multi-objective stochastic alternating minimization:

Each objective function is targeted by a separate sequence of stochastic updates, with the total “effort” per iteration partitioned according to the desired scalarization weight $\lambda$ (Liu et al., 2022).

Stochastic alternating direction method of multipliers (ADMM) and extensions:

Combines alternating minimization with penalty-augmented Lagrangians, solving subproblems (possibly linearized) with stochastic gradient information (Bian et al., 2020, Zhong et al., 2013, Zhao et al., 2013).

Inertial and Bregman-proximal forms:

Enhances SAM by incorporating inertial (momentum) terms and general Bregman geometry, often with variance-reduced stochastic gradient estimators such as SAGA or SARAH (Guo et al., 2023).

A representative pseudocode for two-objective settings, with $K=n_1+n_2$ , is:

For t = 0,1,…,T−1 do
  Set y ← xₜ
  For r=1 to n₁ do:  y ← y – αₜ g¹(y,ξ)
  For r=1 to n₂ do:  y ← y – αₜ g²(y,ξ)
  x_{t+1} ← Proj_{𝒳}(y)
End for

Here,

g^i(y,\xi)

is a stochastic subgradient for

f_i

y

3. Theoretical Analysis and Convergence Rates

Convergence of stochastic alternating minimization algorithms is governed by assumptions on convexity, smoothness, and the variance properties of the gradient or subgradient oracles. The central theoretical results include:

Strongly convex, smooth case:

Using appropriate diminishing step sizes (e.g., $\alpha_t = \frac{2}{cK(t+1)}$ for strong convexity constant $c$ ), SA2GD achieves a sublinear convergence rate:

$\min_{t=1,\dots,T} \mathbb E S(x_t,\lambda) - S(x^*,\lambda) \leq O(1/T)$

where $x^*$ is the minimizer of the weighted sum (Liu et al., 2022).

Convex or nonsmooth cases:

The convergence rate weakens to $O(1/\sqrt{T})$ if the strong convexity assumption is dropped, with analogous rates in the nonsmooth but strongly convex regime (Liu et al., 2022).

Variance-reduced adaptation and nonconvex settings:

With variance-reduced gradient estimators (SAGA, SARAH), algorithms can obtain stronger convergence guarantees (finite-length or linear rate in the KL setting) when the objective satisfies the Kurdyka-Łojasiewicz (KL) property (Bian et al., 2020, Guo et al., 2023). Convergence to critical points is established even for nonsmooth, nonconvex objectives, under a KL property and suitable parameter selection.

Tuning and step-size adaptation:
- Diminishing steps (e.g., $O(1/t)$ or $O(1/\sqrt{t})$ ) balance bias-variance and set the rate.
- Adaptive or meta-learned step-size strategies, as in neural SAMT, can enhance efficiency and robustness (Yan et al., 6 Aug 2025).

These rates are robust to mini-batch stochasticity and, in many cases, are provably unimpaired even if the updates per objective are interleaved or their order is randomized each iteration (Liu et al., 2022).

4. Applications and Practical Implementation

Stochastic alternating minimization algorithms have been empirically and theoretically validated in a diverse set of applications:

Bi-objective and multi-objective optimization:

Direct approximation of the convex Pareto front by running the scheme for various effort allocations $(n_1, n_2)$ , collecting $(f_1(x_T),f_2(x_T))$ for each combination (Liu et al., 2022).

Sparse phase retrieval:

A two-stage stochastic alternating procedure with spectral initialization and randomized alternating minimization achieves exact sparse recovery from $O(s \log n)$ measurements, attaining $O(\log m)$ iteration complexity and near-optimal sample efficiency (Cai et al., 2021).

Large-scale composite problems (e.g., constrained empirical risk minimization, compressed sensing):

Efficient stochastic ADMM variants alternating between primal and dual variables, leveraging per-sample or per-batch surrogate gradients, with $O(1/T)$ convergence for convex cases (Bian et al., 2020, Zhong et al., 2013, Zhao et al., 2013).

Tensor recovery under Tucker-structured constraints:

Alternating stochastic minimization over core tensors and factor matrices with per-batch updates, yielding significant wall-clock speed-up over tensor IHT baselines (Li, 20 Jan 2026).

Nonconvex and nonsmooth problems (matrix factorization, image deblurring):

Two-step inertial Bregman-proximal stochastic alternating minimization with variance-reduced gradients converges to critical points and achieves superior empirical performance in NMF and blind deblurring (Guo et al., 2023).

Neural network optimization:

Layer-wise stochastic alternating minimization with block-wise meta-learned step sizes (SAMT) improves training stability and generalization in deep networks (Yan et al., 6 Aug 2025).

Implementation considerations include:

Inner-loop effort $K=n_1+n_2$ : balances computation per iteration with the fineness of Pareto approximation (Liu et al., 2022).
Mini-batch sizing: larger batches reduce variance at the cost of per-iteration work (Zhou et al., 2020).
Use of efficient projection, proximal, or retraction operations as dictated by constraints and geometry (Guo et al., 2023, Li, 20 Jan 2026).
Variance-reduced sampling (SAGA/SARAH): reduces stochastic error accumulation and improves empirical and theoretical rates (Bian et al., 2020, Guo et al., 2023).
Initialization: crucial in nonconvex regimes (e.g., phase retrieval), where global convergence may only occur if initialized within a local basin (Cai et al., 2021, Okajima et al., 2024).

5. Extensions: Multi-Objective, Nonconvexity, and Beyond

The stochastic alternating minimization framework extends naturally beyond two objectives and can be generalized as follows:

General multi-objective problems ( $m>2$ ):

Assign blockwise effort $(n_1, ..., n_m)$ (with $\sum n_i = K$ ) and minimize the weighted scalarization $\sum_i (n_i/K) f_i(x)$ , recovering corresponding Pareto points (Liu et al., 2022).

Nonconvex and nonsmooth landscapes:

With appropriate inertial and Bregman geometry, as well as variance-reduced stochastic approximations, global convergence to critical points is established under KL-type desingularizing conditions, even in large-scale, nonconvex, and composite objective scenarios (Guo et al., 2023, Bian et al., 2020).

Minimax and adversarial contexts:

Alternating (proximal-)gradient steps in stochastic minimax problems, including nonconvex-concave cases, have recently been analyzed for global convergence rates to first-order stationarity of the Moreau envelope, with explicit step-size schedules and complexity bounds (Boţ et al., 2020).

Statistical-physics-inspired analysis:

The population dynamics of stochastic alternating minimization in high-dimensional settings (e.g., bilinear regression) admit an explicit closed-form via replica methods. The algorithm’s evolution is governed by a two-dimensional discrete stochastic process with memory kernels, revealing phase transitions in attainability and initialization sensitivity (Okajima et al., 2024).

6. Empirical Performance and Comparative Insights

Empirical work across several domains confirms the algorithmic competitiveness of stochastic alternating minimization approaches:

Efficiency:

In multi-objective and composite optimization, alternating stochastic approaches regularly match or beat batch methods in wall-clock time, often through variance-reduction or adaptive step-size learning (Bian et al., 2020, Yan et al., 6 Aug 2025, Li, 20 Jan 2026).

Scalability:

Block-wise, stochastic, and adaptive-proximal extensions admit application to million-sample or high-dimensional problems, unlike naive batch or full-gradient methods (Zhong et al., 2013, Zhao et al., 2013).

Solution Quality:

Methods such as SAMT for neural networks deliver enhanced generalization with fewer parameter updates, and in signal processing the sparse phase retrieval SAM method achieves exact sparse estimation with fewer measurements and less computation than previous algorithms (Yan et al., 6 Aug 2025, Cai et al., 2021).

Robustness and Flexibility:

The alternating paradigm enables natural integration of problem-specific structure (e.g., tensor factorization, group-structured constraints) and supports a wide range of geometries via Bregman distances, retractions, and blockwise adaptation (Li, 20 Jan 2026, Guo et al., 2023).

7. Limitations, Open Problems, and Directions

Despite their effectiveness, stochastic alternating minimization algorithms face several ongoing challenges:

Global convergence in nonconvex settings:

Attainability of globally optimal or Pareto solutions often requires favorable initialization; phase transitions and local minima are observed in high-dimensional or low-sample regimes (Okajima et al., 2024, Cai et al., 2021).

Proofs of sharp convergence rates (beyond sublinear):

While $O(1/T)$ (strongly convex) and $O(1/\sqrt{T})$ (convex) rates are established in certain cases, proving linear or superlinear rates (and understanding tight lower bounds) in nonconvex, variance-reduced, or adaptively accelerated variants remains a target (Bian et al., 2020, Guo et al., 2023).

Hyperparameter selection:

Systematic approaches to step-size, effort allocation $(n_1, ..., n_m)$ , block partition, and batch size remain largely empirical. Meta-learning-based step-size adaptation shows promise (Yan et al., 6 Aug 2025).

Extension to non-Euclidean and manifold-structured variables:

Stochastic Riemannian gradient and retraction-based alternating schemes are in early development for tensor, matrix, or manifold-valued blocks (Li, 20 Jan 2026).

A plausible implication is that stochastic alternating minimization, due to its modular structure and compatibility with stochastic, block-wise, and manifold-geometric methods, will remain central in scalable nonconvex optimization, multi-objective estimation, and large-scale machine learning.

Principal references: (Liu et al., 2022, Bian et al., 2020, Zhou et al., 2020, Cai et al., 2021, Zhong et al., 2013, Fan et al., 2021, Yan et al., 6 Aug 2025, Zhao et al., 2013, Boţ et al., 2020, Okajima et al., 2024, Li, 20 Jan 2026, Guo et al., 2023).