Majorization-Minimization (MM) Theory

Updated 30 March 2026

Majorization-Minimization is an optimization framework that constructs surrogate functions to iteratively guarantee a decrease in the objective value.
It employs quadratic, nonconvex, and locally majorant surrogates to effectively tackle challenges in statistical estimation, machine learning, and signal processing.
Convergence theory ensures stationarity and offers sublinear or linear rates, with extensions including incremental, min–max, and universal MM methods.

Majorization-Minimization (MM) theory formalizes a class of iterative optimization algorithms fundamental to statistical estimation, signal processing, and modern machine learning. MM algorithms proceed by constructing and sequentially minimizing surrogate upper bounds—majorizers—of the objective function, thereby guaranteeing monotonic improvement and often bypassing difficulties associated with non-convexity, non-differentiability, or high dimensionality. Recent advances encompass universal MM via automatic differentiation, nonconvex and locally majorant surrogates, min–max formulations for saddle structure, and stochastic and incremental adaptations for large-scale machine learning.

1. Formal Definition and Core Properties

Given an objective $f : \mathbb{R}^n \to \mathbb{R}$ , a majorizer at iteration $x_t$ is a function $M(x_t; z)$ satisfying:

Majorization: $M(x_t; z) \geq f(z)$ for all $z$ ,
Tangency: $M(x_t; x_t) = f(x_t)$ .

An MM algorithm generates a sequence via:

$x_{t+1} = \arg\min_z M(x_t; z)$

which guarantees monotonic decrease:

$f(x_{t+1}) \leq M(x_t; x_{t+1}) \leq f(x_t)$

This construction is agnostic to convexity or smoothness, provided suitable surrogates are available. Under regularity, any cluster point of $\{x_t\}$ is stationary (Streeter, 2023, Nguyen, 2016, Lange et al., 2021).

Quadratic majorizers for $L$ -smooth $f$ are immediate:

$M(x; z) = f(x) + \nabla f(x)^T(z-x) + \frac{L}{2} \|z-x\|^2$

Higher-order or nonconvex surrogates can be constructed by Taylor expansion plus remainder bounds, provided the surplus term enforces global majorization.

2. Surrogate Construction and Algorithmic Variants

Surrogate (majorizing) functions in MM are central to algorithmic performance. The choice depends on the structure of $f$ :

Quadratic/Taylor Bounds: For $L$ -smooth $f$ , using Hessian/Lipschitz bounds.
Jensen/Convexity Bounds: Leveraging convexity, e.g., Fenchel duality for non-differentiable $h$ .
Parameter Splitting/Latent Variable Models: EM as minorization–maximization, where the $Q$ -function is a convex surrogate for the complete data log-likelihood (Nguyen, 2016).
Bregman Majorization: For composite models $f(u)=G(\rho(u)) + R(u)$ , Bregman divergence $D_h$ enables majorizers that are nonconvex but faithful to structure, leading to superior descent in practice (Geiping et al., 2018).
Locally Majorant/Relaxed MM: Requiring only local upper-bounds and asymptotic tangency, enabling direct surrogates for non-smooth, non-convex programs (Xu et al., 2015).
Universal MM via AD: Automatic interval extension of Taylor's theorem constructs polynomial upper bounds in arbitrary (sub)directions, automated for all $f$ in the AD calculus (Streeter, 2023).
Block and Manifold Extensions: Block-MM partitions large variable sets, constructing surrogates per block; when constraints are manifold-valued (Grassmann, Stiefel), surrogates and convergence analysis are formulated in terms of geodesic convexity and tangent spaces (Lopez et al., 2024).

3. Convergence Theory

The cornerstone of MM is guaranteed monotonic descent of $f(x_t)$ . In detail:

Monotonicity: Always $f(x_{t+1}) \leq f(x_t)$ .
Stationarity: Under sufficient regularity and proper surrogate tangency, any limit point is stationary: $\nabla f(x^*) = 0$ (or $0$ in the subdifferential for nonsmooth cases).
Nonasymptotic Rates: For $L$ -smooth $f$ , the sublinear rate

$\min_{1 \leq t \leq T} \|\nabla f(x_t)\|^2 \leq \frac{2L [f(x_1)-f(x_{T+1})]}{T}$

holds; for strongly convex $f$ , a linear rate in objective gap is established (Lange et al., 2021, Streeter, 2023).

KL Property and Non-smooth Cases: With KL-tameness and appropriately convexified surrogates, global convergence to critical points is guaranteed, subsuming nonconvex and non-smooth optimization (Lange et al., 2021, Xu et al., 2015).
Stochastic and Incremental MM: SAGA/SVRG/SARAH-style variance-reduced MM schemes for finite-sum objectives ( $F(x) = \frac{1}{n}\sum f_i(x) + r(x)$ ), achieve stationarity rates of $O(n^{1/2}/\epsilon^2)$ per gradient, matching current best-known complexities (Phan et al., 2023, Mairal, 2014).

4. Generalizations and Extensions

The MM concept generalizes along several axes:

Generalized MM (G-MM): Relaxes the classical requirement that surrogates be tangent at the previous iterate. Progress is tracked via a gap function. The framework allows introduction of problem-specific biases and leads to improved empirical performance on nonconvex and latent-variable models (Parizi et al., 2015).
Min–Max and Primal–Dual MM: Problems naturally suited to saddle-point or dual formulations benefit from MM operating on both the primal and dual blocks. The MM4MM and PDMM strategies exploit surrogates for both variable blocks, ensuring monotonicity and convergence even on nonconvex landscapes (Saini et al., 2024, Fatima et al., 2021).
Universal and Hyperparameter-Free MM: Advances in automatic differentiation for bounding derivatives enable black-box, hyperparameter-free universal MM solvers with guaranteed descent—without the need for hand-tuned line search or analytic majorizer derivation (Streeter, 2023).
Learned Majorizers: Inverse problems with severely ill-posed geometry motivate learning the majorant, e.g., by constraining a recurrent neural network (RNN) to output spectral or diagonal curvature bounds at every MM step, guaranteeing descent while adapting to nontrivial loss curvature (Tran et al., 23 Jan 2026).
Nonconvex Majorizers: For composite nonconvex problems, nonconvex surrogates that are nevertheless globally optimized at each iteration allow escape from shallow stationary points. The tradeoff is solving harder subproblems but with much closer majorizer fidelity (Geiping et al., 2018).
Block MM on Manifolds: Convergence theory extends from Euclidean to manifold constraint sets, provided surrogates are geodesically quasi-convex and directional derivatives are properly matched in the tangent bundle (Lopez et al., 2024).

5. Applications and Algorithmic Practice

MM algorithms have become foundational in numerous domains:

Statistical Estimation and Machine Learning: Mixture models, regularized regression (including nonconvex penalties: SCAD, MCP), SVM fitting via IRLS, and variational inference all admit MM schemes (Wang, 2019, Nguyen et al., 2017, Nguyen, 2016).
Penalized and Sparse Estimation: MM frameworks offer stable, convergent, and interpretable procedures for nonsmooth and nonconvex regularization, often leveraging reweighted soft-thresholding or surrogate-gradient minimization (Schifano et al., 2010, Wang, 2019).
Inverse Problems and Signal Processing: Applications range from phase retrieval and X-ray CT to massive-scale penalized least-squares, where algorithmic design of majorizers (e.g., via duality-based optimization of structured matrix surrogates) is critical for acceleration and computational tractability (McGaffin et al., 2015, Fatima et al., 2021).
Structured and Bayesian Sparsity: MM is naturally linked to hierarchical Bayesian modeling; alternating MM and Gibbs sampling enables uncertainty quantification and mode exploration in heavily multimodal posteriors (e.g., in MEG/EEG inverse source localization) (Bekhti et al., 2017).
Large-scale and Online ML: Incremental, mini-batch, and variance-reduced MM break the computational bottleneck for large datasets without sacrificing convergence guarantees (Mairal, 2014, Phan et al., 2023).

6. Limitations and Open Directions

While MM provides a unifying umbrella for optimization under challenging regimes, practical deployment hinges on appropriate surrogate construction:

Majorizer Design: Nontrivial for compositions, nonconvexities, or complex regularizers. Recent structural or data-driven approaches mitigate, but do not eliminate, the challenge (Streeter, 2023, McGaffin et al., 2015, Tran et al., 23 Jan 2026).
Global vs. Local Minima: For general nonconvex objectives, MM methods guarantee convergence only to stationary points; global optimality depends on additional problem structure, such as geodesic convexity in the manifold-constrained setting (Lopez et al., 2024).
Computational Load: When surrogates are expensive to optimize—e.g., nonconvex or high-dimensional subproblems—per-iteration cost can surpass that of first-order approximations, unless global solvers or efficient block updates are available (Geiping et al., 2018).
Hyperparameter Sensitivity: Universal MM mitigates step size and tuning dependence, but in classical hand-designed fits, majorizer tightness and regularization must be carefully engineered (Streeter, 2023, Xu et al., 2015).
Exploration vs. Exploitation: Generalized MM (G-MM) introduces an exploration dimension but open questions remain on balancing progress and diversity; algorithmic tuning and theoretical convergence rate quantification are active fronts (Parizi et al., 2015, Phan et al., 2023).

7. Reference Algorithms and Summary Table

A selection of archetypal MM algorithms and their domains is summarized below:

Algorithm	Majorizer Class	Application Domain
EM	Latent-variable surrogates	Mixture models
IRLS MM	Block diagonal/weighted quad	SVM/logistic regression
Bregman-prox MM	Bregman (possibly nonconvex)	Composite imaging problems
PDMM, MM4MM	Min–max, blockwise	Poisson, saddle-point
Universal MM (AD)	Polynomial-AD, hyperparameter-free	Generic ML, deep nets
G-MM	Relaxed bound, application-biased	Latent SVM, k-means
Incremental/SVR-MM	Variance-reduced, incremental	Large-scale ML, ERM
Learned MM networks	Data-driven majorant, RNN	Ill-posed inverse problems

All MM variants share the two essential steps: constructing a majorizing surrogate, and exact or approximate minimization thereof at each iteration, underpinned by monotonic descent and stationary convergence (Nguyen, 2016, Streeter, 2023, Lange et al., 2021).

References:

(Streeter, 2023, Saini et al., 2024, Xu et al., 2015, Tran et al., 23 Jan 2026, Geiping et al., 2018, Lopez et al., 2024, Nguyen, 2016, Wang, 2019, Bekhti et al., 2017, Schifano et al., 2010, Parizi et al., 2015, Fatima et al., 2021, Lange et al., 2021, Nguyen et al., 2017, Phan et al., 2023, Mairal, 2014, McGaffin et al., 2015)