Majorization-Minimization Algorithms

Updated 10 November 2025

Majorization-Minimization (MM) algorithms are iterative optimization methods that replace complex objectives with simpler surrogate functions to ensure guaranteed descent.
They construct surrogates that majorize the original objective, enabling efficient handling of nonconvex, nonsmooth, and large-scale problems.
Variants such as higher-order, stochastic, and variance-reduced MM offer convergence rates from sublinear to superlinear, broadening their applicability in modern optimization.

Majorization-Minimization (MM) algorithms are a broad family of iterative optimization schemes that replace a difficult objective function with a succession of easier surrogates—majorizers—that tightly bound the objective at the current iterate. By design, each MM step monotonically reduces the objective, and the method is applicable to a wide range of nonconvex, nonsmooth, constrained, or large-scale settings. MM algorithms are foundational to modern optimization for statistics, signal processing, and machine learning, with theoretical guarantees and rich algorithmic variants.

1. Core Principles and Formal Definition

The classical MM framework seeks to minimize a given function $f(x)$ by generating a sequence $\{x_k\}$ such that at each iteration:

A majorizing surrogate $U(x; x_k)$ $U (x; x_{k})$ is constructed satisfying:
1. $U(x; x_k) \ge f(x)$ for all $x$ (global upper bound)
2. $U(x_k; x_k) = f(x_k)$ (tangency at current iterate)
The next iterate is obtained by:

$x_{k+1} = \arg\min_x U(x; x_k)$

The descent property is immediate:

$f(x_{k+1}) \le U(x_{k+1}; x_k) \le U(x_k; x_k) = f(x_k)$

Such a setup ensures monotonic decrease of $f(x_k)$ over iterations. The classical MM can be interpreted as a meta-algorithm—turning a hard minimization into a series of simpler subproblems, each easier due to the structure of $U(x; x_k)$ .

2. Construction and Classes of Surrogates

2.1 Traditional Surrogate Construction

Surrogates are often created using convexity (Jensen's inequality), Taylor expansions with upper bounding of the remainder, EM-style conditional expectations, or quadratic upper bounds exploiting Lipschitz continuity of the Hessian.

Examples include:

Jensen-based bound for log-sum-exp in EM
Quadratic majorization of smooth terms: $f(x) \le f(x_k) + \nabla f(x_k)^T (x - x_k) + \frac{L}{2}\|x - x_k\|^2$ for $L$ -smooth $f$
Linearization of nondifferentiable penalties, custom bounding for DC (difference of convex) problems

2.2 Higher-Order and Adaptive Majorization

Recent advances include higher-order MM where the surrogate matches the function and its derivatives up to order $p$ at $x_k$ , and the error $h(y; x_k) = U(y; x_k) - f(y)$ is $p$ -times differentiable with Lipschitz $p$ th derivative. Implementation requires constructing $U$ such that $h(y; x_k) \ge 0$ for all $y$ , and $h(x_k; x_k) = \nabla h(x_k; x_k) = \dots = \nabla^p h(x_k; x_k) = 0$ (Necoara et al., 2020, Lupu et al., 2021).

Automatic surrogate generation has emerged, e.g., "Universal MM" algorithms leverage Taylor-mode automatic differentiation and interval bounding on Taylor remainders to construct tight surrogate upper bounds programmatically for arbitrary (smooth) $f$ (Streeter, 2023). This allows "black-box" MM for user-supplied objectives and eliminates the need for hand-designed surrogates.

3. Convergence Theory and Rate Results

Assuming surrogates satisfy the classical MM properties and the domain is closed and level sets are compact, the MM sequence:

Is guaranteed to be non-increasing
Has limit points that are stationary for nonconvex $f$
For convex $f$ with strongly convex surrogates, achieves global linear convergence:

$f(x_k) - f^* \le \beta^k (f(x_0)-f^*) \quad \text{for some } \beta \in (0, 1)$

For $p$ th-order MM, global sublinear rates $O(1/k^p)$ in convex settings and local superlinear rates under uniform convexity are established (Necoara et al., 2020, Lupu et al., 2021)
For objectives with the Kurdyka–Łojasiewicz property, precise local rates ranging from sublinear, linear, to superlinear can be shown, depending on the exponent in the KL-inequality (Necoara et al., 2020, Lupu et al., 2021)

Stochastic variants, including stochastic higher-order MM and stochastic majorization-minimization (SMM), feature sublinear $O(1/\sqrt{n})$ rates for convex problems, $O(1/n)$ for strongly convex, and almost-sure convergence to stationary points for broad nonconvex problems (Mairal, 2013, Lupu et al., 2021).

Variance-reduced MM algorithms (incorporating SAGA, SVRG, SARAH estimators) further reduce gradient sample complexity, with optimal rates $\tilde{O}(n^{1/2}/\epsilon^2)$ for nonconvex finite-sum composite problems (Phan et al., 2023).

4. Algorithmic Variants and Extensions

Variant	Distinguishing Feature	Typical Application/Advantage
Universal MM (Streeter, 2023)	Automatic, derivative-based bounds	Black-box optimization without tuning
Incremental MM (MISO) (Mairal, 2014)	Surrogates updated per sample	Large-scale sum-structure; linear rates
Stochastic MM (Mairal, 2013)	Surrogate per minibatch or sample	Online/streaming, memory and compute tractable
Variance-reduced MM (Phan et al., 2023)	SVRG/SAGA in MM subproblems	Best-known gradient complexities for finite-sums
Higher-order MM (Necoara et al., 2020 Lupu et al., 2021)	$p$ th-order Taylor bounding	Superlinear local convergence under regularity
Bregman MM (Martin et al., 13 Jan 2025)	Adaptive, potentially non-Euclidean surrogates	Accelerated convergence for composite objectives
MM for matrix means (Zhang, 2013)	Manifold optimization; closed-form updates	Riemannian means in SPD geometry
Min-max MM (MM4MM) (Saini et al., 12 Nov 2024)	Surrogate for min-max reformulations	Nonconvex-constrained signal-processing
Generalized MM (G-MM) (Parizi et al., 2015)	Relaxes "touching" to "progress"	Robustness to initialization, application-specific bias

Algorithmic building blocks

SafeRate and SafeCombination (Universal MM): Hyperparameter-free, uses local Taylor majorizers, can adapt step-size over arbitrary smooth $f$ (Streeter, 2023)
MISO: Incremental update of per-sample surrogates, global surrogate minimized per iteration; achieves accelerated convergence for large T (Mairal, 2014)
Stochastic Proximal MM: Running surrogate is a weighted sum of per-sample surrogates, updated at every stochastic sample. Step-size schedule critical for rate (Mairal, 2013)
Variance reduction: MM-SAGA/SVRG/SARAH—MM subproblems solved with variance-reduced first-order estimators (Phan et al., 2023)

5. Practical Applications and Empirical Results

MM algorithms underpin a wide range of applications, including, but not limited to:

Gaussian Mixture Regression, Multinomial Logistic Regression, and SVM: MM algorithms can build quadratic or EM-style surrogates yielding closed-form or fast iterative updates (Nguyen, 2016 Nguyen et al., 2017)
Nonnegative binary matrix factorization: MM with Jensen-type surrogates yields closed-form update rules competitive with logistic-PCA and interpretable factorization models (Magron et al., 2022)
Bilevel hyperparameter optimization: MM on duality-refomulated single-level problems enables efficient conic-program subproblem solutions for otherwise intractable bilevel programs (Chen et al., 1 Mar 2024)
Signal processing min-max problems: MM4MM leverages dual representations and majorized min-max alternation to enable hyperparameter-free, provably monotonic algorithms in phase retrieval, beamforming, sensor placement, etc. (Saini et al., 12 Nov 2024)
High-dimensional regression with nonsmooth/nonconvex penalties: MM with iterated soft-thresholding or semismooth Newton subproblems achieves fast, scalable regression with theoretical guarantees for support recovery and convergence (Schifano et al., 2010, Tang et al., 2019)
Deep neural network optimization: Universal MM methods can be applied layerwise to ensure safe, monotonic descent even under extreme overparameterization (Streeter, 2023)
Dirichlet maximum-likelihood: Variable Bregman MM accelerates parameter estimation compared to Newton-type and fixed-metric methods (Martin et al., 13 Jan 2025)

6. Limitations, Challenges, and Current Directions

Despite their generality, MM algorithms face important operational and theoretical considerations:

Surrogate construction is nontrivial for non-smooth, high-dimensional, or non-Euclidean objectives. Recent work on automatic differentiable majorizer construction and variable-metric methods has broadened their applicability (Streeter, 2023 Martin et al., 13 Jan 2025)
The efficiency of MM steps depends critically on the tractability of the surrogate subproblems. For some models, these may require bespoke solvers or reformulations (e.g., conic programming for bilevel optimization (Chen et al., 1 Mar 2024), semismooth Newton for nonconvex regression (Tang et al., 2019)).
MM sequence convergence may be slow under ill-conditioned or tight surrogate settings; acceleration methods such as quasi-Newton acceleration (e.g., SQUAREM), adaptive metric methods, variance-reduction, and higher-order MM address this, but often require problem-specific tuning (1001.47762305.06848).
Classical MM's requirement that surrogates touch the objective at the current iterate can be unnecessarily restrictive in nonconvex and latent-variable settings; the Generalized MM (G-MM) framework relaxes this via a "progress" requirement, enabling more robust optimization (Parizi et al., 2015).
In stochastic and large-scale regimes, memory and communication constraints have led to the development of incremental, online, and variance-reduced MM variants; optimal choices of weights and batch sizes remain an active area of research (1306.46501402.4419Phan et al., 2023).
Extensions to non-Euclidean geometries, non-Lipschitz problems, or those with more complex constraint sets are ongoing areas of method development (Martin et al., 13 Jan 2025 Saini et al., 12 Nov 2024).

7. Summary Table: Key MM Algorithm Variants

MM Variant / Context	Core Innovation	Notable Features
Universal MM (Streeter, 2023)	Automatic Taylor-remainder surrogates	Hyperparameter-free, black-box
Higher-Order MM (Necoara et al., 2020 Lupu et al., 2021)	$p$ th-order surrogates, fast local rates	Superlinear convergence, adaptive
Variance-Reduced MM (Phan et al., 2023)	SAGA/SVRG/SARAH in MM subproblems	Optimal sample complexity
Bregman MM (Martin et al., 13 Jan 2025)	Variable, adaptive Bregman majorizers	Accelerated convergence
MISO/Incremental MM (Mairal, 2014)	Per-sample surrogates, updated incrementally	Linear rates for convex
Generalized MM (G-MM) (Parizi et al., 2015)	Progress in place of touching constraint	Exploratory, less initialization-sensitive
MM for bilevel programs (Chen et al., 1 Mar 2024)	Surrogates for dual-based reformulations	Efficient conic subproblems
MM4MM min-max (Saini et al., 12 Nov 2024)	Surrogates on min-max reformulations	Monotonic, hyperparameter-free
Nonneg. binary NMF (Magron et al., 2022)	Closed-form Jensen surrogates	Interpretable factors, simple updates

References

Key references for further paper include:

(Streeter, 2023) Universal Majorization-Minimization Algorithms
(Necoara et al., 2020) A systematic approach to general higher-order MM algorithms
(Lupu et al., 2021) Convergence analysis of stochastic higher-order MM algorithms
(Martin et al., 13 Jan 2025) Variable Bregman Majorization-Minimization Algorithm...
(Phan et al., 2023) Stochastic Variance-Reduced Majorization-Minimization Algorithms
(Chen et al., 1 Mar 2024) Lower-level Duality Based Reformulation and Majorization Minimization...
(Mairal, 2014, Mairal, 2013) Incremental and stochastic MM for large-scale problems
(Magron et al., 2022, Nguyen et al., 2017, Nguyen, 2016, Schifano et al., 2010) for classical and contemporary applications

The MM paradigm provides a versatile and well-founded approach for tackling the full complexity of contemporary machine learning, optimization, and statistical estimation problems—both theoretically and at scale.