Majorization-Minimization Algorithms
- Majorization-Minimization (MM) algorithms are iterative optimization methods that replace complex objectives with simpler surrogate functions to ensure guaranteed descent.
- They construct surrogates that majorize the original objective, enabling efficient handling of nonconvex, nonsmooth, and large-scale problems.
- Variants such as higher-order, stochastic, and variance-reduced MM offer convergence rates from sublinear to superlinear, broadening their applicability in modern optimization.
Majorization-Minimization (MM) algorithms are a broad family of iterative optimization schemes that replace a difficult objective function with a succession of easier surrogates—majorizers—that tightly bound the objective at the current iterate. By design, each MM step monotonically reduces the objective, and the method is applicable to a wide range of nonconvex, nonsmooth, constrained, or large-scale settings. MM algorithms are foundational to modern optimization for statistics, signal processing, and machine learning, with theoretical guarantees and rich algorithmic variants.
1. Core Principles and Formal Definition
The classical MM framework seeks to minimize a given function by generating a sequence such that at each iteration:
- A majorizing surrogate is constructed satisfying:
- for all (global upper bound)
- (tangency at current iterate)
The next iterate is obtained by:
- The descent property is immediate:
Such a setup ensures monotonic decrease of over iterations. The classical MM can be interpreted as a meta-algorithm—turning a hard minimization into a series of simpler subproblems, each easier due to the structure of .
2. Construction and Classes of Surrogates
2.1 Traditional Surrogate Construction
Surrogates are often created using convexity (Jensen's inequality), Taylor expansions with upper bounding of the remainder, EM-style conditional expectations, or quadratic upper bounds exploiting Lipschitz continuity of the Hessian.
Examples include:
- Jensen-based bound for log-sum-exp in EM
- Quadratic majorization of smooth terms: for -smooth
- Linearization of nondifferentiable penalties, custom bounding for DC (difference of convex) problems
2.2 Higher-Order and Adaptive Majorization
Recent advances include higher-order MM where the surrogate matches the function and its derivatives up to order at , and the error is -times differentiable with Lipschitz th derivative. Implementation requires constructing such that for all , and (Necoara et al., 2020, Lupu et al., 2021).
Automatic surrogate generation has emerged, e.g., "Universal MM" algorithms leverage Taylor-mode automatic differentiation and interval bounding on Taylor remainders to construct tight surrogate upper bounds programmatically for arbitrary (smooth) (Streeter, 2023). This allows "black-box" MM for user-supplied objectives and eliminates the need for hand-designed surrogates.
3. Convergence Theory and Rate Results
Assuming surrogates satisfy the classical MM properties and the domain is closed and level sets are compact, the MM sequence:
- Is guaranteed to be non-increasing
- Has limit points that are stationary for nonconvex
- For convex with strongly convex surrogates, achieves global linear convergence:
- For th-order MM, global sublinear rates in convex settings and local superlinear rates under uniform convexity are established (Necoara et al., 2020, Lupu et al., 2021)
- For objectives with the Kurdyka–Łojasiewicz property, precise local rates ranging from sublinear, linear, to superlinear can be shown, depending on the exponent in the KL-inequality (Necoara et al., 2020, Lupu et al., 2021)
Stochastic variants, including stochastic higher-order MM and stochastic majorization-minimization (SMM), feature sublinear rates for convex problems, for strongly convex, and almost-sure convergence to stationary points for broad nonconvex problems (Mairal, 2013, Lupu et al., 2021).
Variance-reduced MM algorithms (incorporating SAGA, SVRG, SARAH estimators) further reduce gradient sample complexity, with optimal rates for nonconvex finite-sum composite problems (Phan et al., 2023).
4. Algorithmic Variants and Extensions
| Variant | Distinguishing Feature | Typical Application/Advantage |
|---|---|---|
| Universal MM (Streeter, 2023) | Automatic, derivative-based bounds | Black-box optimization without tuning |
| Incremental MM (MISO) (Mairal, 2014) | Surrogates updated per sample | Large-scale sum-structure; linear rates |
| Stochastic MM (Mairal, 2013) | Surrogate per minibatch or sample | Online/streaming, memory and compute tractable |
| Variance-reduced MM (Phan et al., 2023) | SVRG/SAGA in MM subproblems | Best-known gradient complexities for finite-sums |
| Higher-order MM (Necoara et al., 2020Lupu et al., 2021) | th-order Taylor bounding | Superlinear local convergence under regularity |
| Bregman MM (Martin et al., 13 Jan 2025) | Adaptive, potentially non-Euclidean surrogates | Accelerated convergence for composite objectives |
| MM for matrix means (Zhang, 2013) | Manifold optimization; closed-form updates | Riemannian means in SPD geometry |
| Min-max MM (MM4MM) (Saini et al., 12 Nov 2024) | Surrogate for min-max reformulations | Nonconvex-constrained signal-processing |
| Generalized MM (G-MM) (Parizi et al., 2015) | Relaxes "touching" to "progress" | Robustness to initialization, application-specific bias |
Algorithmic building blocks
- SafeRate and SafeCombination (Universal MM): Hyperparameter-free, uses local Taylor majorizers, can adapt step-size over arbitrary smooth (Streeter, 2023)
- MISO: Incremental update of per-sample surrogates, global surrogate minimized per iteration; achieves accelerated convergence for large T (Mairal, 2014)
- Stochastic Proximal MM: Running surrogate is a weighted sum of per-sample surrogates, updated at every stochastic sample. Step-size schedule critical for rate (Mairal, 2013)
- Variance reduction: MM-SAGA/SVRG/SARAH—MM subproblems solved with variance-reduced first-order estimators (Phan et al., 2023)
5. Practical Applications and Empirical Results
MM algorithms underpin a wide range of applications, including, but not limited to:
- Gaussian Mixture Regression, Multinomial Logistic Regression, and SVM: MM algorithms can build quadratic or EM-style surrogates yielding closed-form or fast iterative updates (Nguyen, 2016Nguyen et al., 2017)
- Nonnegative binary matrix factorization: MM with Jensen-type surrogates yields closed-form update rules competitive with logistic-PCA and interpretable factorization models (Magron et al., 2022)
- Bilevel hyperparameter optimization: MM on duality-refomulated single-level problems enables efficient conic-program subproblem solutions for otherwise intractable bilevel programs (Chen et al., 1 Mar 2024)
- Signal processing min-max problems: MM4MM leverages dual representations and majorized min-max alternation to enable hyperparameter-free, provably monotonic algorithms in phase retrieval, beamforming, sensor placement, etc. (Saini et al., 12 Nov 2024)
- High-dimensional regression with nonsmooth/nonconvex penalties: MM with iterated soft-thresholding or semismooth Newton subproblems achieves fast, scalable regression with theoretical guarantees for support recovery and convergence (Schifano et al., 2010, Tang et al., 2019)
- Deep neural network optimization: Universal MM methods can be applied layerwise to ensure safe, monotonic descent even under extreme overparameterization (Streeter, 2023)
- Dirichlet maximum-likelihood: Variable Bregman MM accelerates parameter estimation compared to Newton-type and fixed-metric methods (Martin et al., 13 Jan 2025)
6. Limitations, Challenges, and Current Directions
Despite their generality, MM algorithms face important operational and theoretical considerations:
- Surrogate construction is nontrivial for non-smooth, high-dimensional, or non-Euclidean objectives. Recent work on automatic differentiable majorizer construction and variable-metric methods has broadened their applicability (Streeter, 2023Martin et al., 13 Jan 2025)
- The efficiency of MM steps depends critically on the tractability of the surrogate subproblems. For some models, these may require bespoke solvers or reformulations (e.g., conic programming for bilevel optimization (Chen et al., 1 Mar 2024), semismooth Newton for nonconvex regression (Tang et al., 2019)).
- MM sequence convergence may be slow under ill-conditioned or tight surrogate settings; acceleration methods such as quasi-Newton acceleration (e.g., SQUAREM), adaptive metric methods, variance-reduction, and higher-order MM address this, but often require problem-specific tuning (1001.47762305.06848).
- Classical MM's requirement that surrogates touch the objective at the current iterate can be unnecessarily restrictive in nonconvex and latent-variable settings; the Generalized MM (G-MM) framework relaxes this via a "progress" requirement, enabling more robust optimization (Parizi et al., 2015).
- In stochastic and large-scale regimes, memory and communication constraints have led to the development of incremental, online, and variance-reduced MM variants; optimal choices of weights and batch sizes remain an active area of research (1306.46501402.4419Phan et al., 2023).
- Extensions to non-Euclidean geometries, non-Lipschitz problems, or those with more complex constraint sets are ongoing areas of method development (Martin et al., 13 Jan 2025Saini et al., 12 Nov 2024).
7. Summary Table: Key MM Algorithm Variants
| MM Variant / Context | Core Innovation | Notable Features |
|---|---|---|
| Universal MM (Streeter, 2023) | Automatic Taylor-remainder surrogates | Hyperparameter-free, black-box |
| Higher-Order MM (Necoara et al., 2020Lupu et al., 2021) | th-order surrogates, fast local rates | Superlinear convergence, adaptive |
| Variance-Reduced MM (Phan et al., 2023) | SAGA/SVRG/SARAH in MM subproblems | Optimal sample complexity |
| Bregman MM (Martin et al., 13 Jan 2025) | Variable, adaptive Bregman majorizers | Accelerated convergence |
| MISO/Incremental MM (Mairal, 2014) | Per-sample surrogates, updated incrementally | Linear rates for convex |
| Generalized MM (G-MM) (Parizi et al., 2015) | Progress in place of touching constraint | Exploratory, less initialization-sensitive |
| MM for bilevel programs (Chen et al., 1 Mar 2024) | Surrogates for dual-based reformulations | Efficient conic subproblems |
| MM4MM min-max (Saini et al., 12 Nov 2024) | Surrogates on min-max reformulations | Monotonic, hyperparameter-free |
| Nonneg. binary NMF (Magron et al., 2022) | Closed-form Jensen surrogates | Interpretable factors, simple updates |
References
Key references for further paper include:
- (Streeter, 2023) Universal Majorization-Minimization Algorithms
- (Necoara et al., 2020) A systematic approach to general higher-order MM algorithms
- (Lupu et al., 2021) Convergence analysis of stochastic higher-order MM algorithms
- (Martin et al., 13 Jan 2025) Variable Bregman Majorization-Minimization Algorithm...
- (Phan et al., 2023) Stochastic Variance-Reduced Majorization-Minimization Algorithms
- (Chen et al., 1 Mar 2024) Lower-level Duality Based Reformulation and Majorization Minimization...
- (Mairal, 2014, Mairal, 2013) Incremental and stochastic MM for large-scale problems
- (Magron et al., 2022, Nguyen et al., 2017, Nguyen, 2016, Schifano et al., 2010) for classical and contemporary applications
The MM paradigm provides a versatile and well-founded approach for tackling the full complexity of contemporary machine learning, optimization, and statistical estimation problems—both theoretically and at scale.