Majorization-Minimization (MM) Algorithm

Updated 30 November 2025

Majorization-Minimization (MM) is an iterative optimization framework that constructs surrogate functions which upper-bound the objective and ensure monotonic descent.
It employs various surrogate designs, such as quadratic and first-order approximations, to address high-dimensional estimation and penalized regression challenges.
Recent advances include generalized, incremental, and stochastic variants that accelerate convergence and broaden applications in machine learning and signal processing.

Majorization–Minimization (MM) is a general iterative framework for solving non-convex and non-smooth optimization problems by constructing at each iteration a tractable surrogate function that upper-bounds the objective and matches it at the current point. MM algorithms provide monotonic decrease of the objective and have become foundational in machine learning, statistics, signal processing, and high-dimensional estimation. This article systematically examines the core theoretical basis, key algorithmic variants, surrogate constructions, convergence properties, and representative applications of MM, including emerging generalizations such as Generalized MM, Incremental and Stochastic Variance-Reduced MM, Min–Max MM frameworks, and universal MM optimizers.

1. Formulation and Principles of Majorization–Minimization

MM seeks to minimize an objective $f:\mathbb{R}^d\to\mathbb{R}$ by iteratively constructing surrogates $g(\cdot\mid x^{(t)})$ satisfying:

Majorization: $g(x\mid x^{(t)}) \geq f(x)$ for all $x$ ,
Touching: $g(x^{(t)}\mid x^{(t)}) = f(x^{(t)})$ .

At iteration $t$ , the update is $x^{(t+1)} = \arg\min_{x} g(x\mid x^{(t)})$ . The framework guarantees a non-increasing sequence $\{f(x^{(t)})\}$ due to $f(x^{(t+1)}) \leq g(x^{(t+1)}\mid x^{(t)}) \leq f(x^{(t)})$ (Nguyen, 2016, Wang, 2019, Mairal, 2014). This principle generalizes classical EM, proximal algorithms, and block coordinate descent under suitable surrogate choices (Nguyen, 2016).

2. Surrogate Construction and Algorithmic Variants

Surrogate function design is crucial for the effectiveness of MM:

Quadratic Majorization: For a $\beta$ -smooth $f$ , $g(x\mid x^{(t)}) = f(x^{(t)}) + \nabla f(x^{(t)})^T(x-x^{(t)}) + \frac{\beta}{2}\|x-x^{(t)}\|^2$ serves as the canonical majorizer (Nguyen, 2016, Streeter, 2023).

First-Order Surrogates: General framework employs first-order surrogates with $L$ -smooth error $h(x) = g(x) - f(x)$ , leading to convergence results in both non-convex and convex regimes (Mairal, 2014).

Componentwise/Separable Surrogates: For coordinate penalties (LASSO, SCAD, MCP), majorize penalties by tangents or local linear approximations, facilitating soft-thresholding updates (Schifano et al., 2010, Wang, 2019).

Iteratively-Reweighted Surrogates: In non-convex sparse regression or group Lasso, concave penalties $r^p$ are upper bounded by linear terms to yield reweighted convex subproblems, as in IRLS and Adaptive Lasso (Bekhti et al., 2017).

Duality-Based Matrix Majorizers: In large-scale inverse problems, quadratic surrogates replace Hessians with structured majorizers $\Lambda \succeq H$ constructed algorithmically via dual ascent, dramatically accelerating convergence (McGaffin et al., 2015).

3. Generalizations: Relaxed, Generalized, Incremental, and Stochastic MM

Generalized MM (G-MM): Classical MM’s touching condition is relaxed; surrogate may only “touch” the objective at initialization and progress is measured by decrease of the upper bound sequence $b_t$ . This enables deterministic or stochastic surrogate selection, incorporation of bias functions, and reduced sensitivity to initialization. Under mild regularity (compactness and strong convexity), convergence to stationary points and vanishing surrogate–objective gap is proved (Parizi et al., 2015). G-MM demonstrates superior empirical performance in latent variable models and clustering compared to EM/CCCP.

Relaxed MM: Only requires local majorization (at next iterate) and vanishing difference in directional derivatives asymptotically, bypassing global majorization. This enables tighter, locally-focused surrogates for non-smooth/nonconvex programs (e.g., robust matrix factorization), guaranteeing descent and stationarity under weak assumptions (Xu et al., 2015).

Incremental MM (MISO): Applied to finite-sum objectives $f(x) = \frac{1}{T}\sum_{t=1}^T f^t(x)$ , MM surrogates are updated and minimized incrementally, yielding per-iteration cost independent of $T$ (Mairal, 2014). For convex and strongly convex problems, MISO attains $O(T/n)$ or linear rates, and yields stationary points in non-convex settings.

Stochastic Variance-Reduced MM (SVRMM): MM is paired with SAGA, SVRG, SARAH estimators for stochastic problems with large $n$ . The update leverages a variance-reduced gradient estimator in the surrogate minimization. SVRMM preserves monotonicity, guarantees almost sure subsequential convergence to stationary points, and achieves optimal gradient complexities of $O(n^{1/2}/\epsilon^2)$ in nonconvex composite settings (Phan et al., 2023).

4. Convergence Theory and Acceleration

Monotonicity and Limit Points: Under mild regularity (surrogate continuity, objective coercivity, boundedness below), MM yields a monotone objective sequence with accumulation points that are stationary for $f$ (Nguyen, 2016, Wang, 2019). For non-convex settings, convergence is typically to Dini or Clarke stationary points (Wang, 2019).

Local/Asymptotic Convergence Rates: For strictly convex or strongly convex surrogates and objectives, linear or geometric convergence rates are established. For Riemannian objectives (e.g., Karcher mean for PSD matrices), MM achieves global convergence and linear rates (Zhang, 2013).

Quasi-Newton Acceleration: Casting MM as a fixed-point algorithm $M(\theta)$ , Broyden’s quasi-Newton root-finding accelerates convergence via rank-one updates or limited-memory variants, with proven local linear and in favorable cases superlinear rates (Agarwal et al., 2022). This approach is problem-agnostic and compatible with SQUAREM variants.

Universal MM via Automatic Majorizer Derivation: Automatic Taylor-mode differentiation constructs local polynomial majorizers (“AutoBound”) for arbitrary differentiable objectives, eliminating manual surrogate design and hyperparameter tuning, underlying universal MM optimizers such as SafeRate and SafeCombination (Streeter, 2023).

5. Representative Applications in Machine Learning and Signal Processing

Penalized Estimation and Sparse Regression

MM is widely adopted for high-dimensional penalized estimation, where it enables simultaneous variable selection and estimation via tractable coordinate-wise soft-thresholding (LASSO, SCAD, MCP), block coordinate descent (group Lasso), and robust regression under contaminated data (Wang, 2019, Schifano et al., 2010, Bekhti et al., 2017).

High-Dimensional Fused Lasso and Graph Learning

Fused lasso regression benefits from MM constructions enabling GPU-parallelization, per-iteration quadratic updates, and analytic closed-form minimization, with guarantees on monotonic convergence and finite support recovery (Yu et al., 2013). MM frameworks also undergird fast sparse graph learning algorithms for smooth node signals, yielding variable elimination and hyperparameter-free operation (Fatima et al., 2022).

Signal Processing: Phase Retrieval, Sensor Placement, Array Processing

Min–Max reformulation (“MM4MM”) leverages convex–concave surrogates and dual reformulations to handle phase retrieval, beamforming, sensor placement and source localization problems, yielding parameter-free, monotonic algorithms (Saini et al., 12 Nov 2024, Scheibler et al., 2021).

Matrix Means and Non-Convex Penalties

MM enables computation of the Karcher mean for positive definite matrices with global convergence and analytic updates (Zhang, 2013). In non-convex least squares with Moreau envelope “DC” penalties, MM constructs convex majorizers via ball-constrained minimization (Mayeli, 2019).

6. Practical Guidelines, Limitations, and Software

Algorithmic choices impact MM performance: surrogate tightness (step size), stochasticity/bias in surrogate selection (G-MM), acceleration procedures, and support identification (variable elimination). MM methods require careful initialization for non-convex models, may converge to suboptimal local minima, and are sensitive to surrogate looseness (Parizi et al., 2015, Xu et al., 2015).

MM algorithms are widely available in scientific software, e.g., R package mpath for penalized estimation (Wang, 2019). GPU implementations and active-set heuristics enhance scalability in high dimensions (Yu et al., 2013). Universal MM optimizers automate majorizer construction, facilitating broad applicability without domain-specific tuning (Streeter, 2023).

7. Emerging Directions and Theoretical Developments

Recent advances focus on relaxed and generalized constraint formulations, stochastic and incremental strategies, quasi-Newton acceleration and root-finding, duality-driven matrix majorizer design, min–max reformulations for saddle problems, and automatic majorizer synthesis.

MM remains central to development of robust, scalable algorithms for nonconvex optimization, structured sparsity, inverse problems, latent variable estimation, and high-dimensional statistical modeling. Active research continues on adaptive surrogate tuning, distributed/federated MM, and rigorous complexity analysis in nonconvex and high-dimensional regimes (Xu et al., 2015, Phan et al., 2023, Streeter, 2023, Saini et al., 12 Nov 2024).