Maximum Likelihood Estimation

Updated 14 January 2026

Maximum Likelihood Estimation is a method that selects parameter values by maximizing the likelihood function, ensuring consistency, asymptotic normality, and efficiency under regular conditions.
It employs iterative procedures like Newton-Raphson, EM algorithms, and algebraic techniques to solve complex score equations in diverse statistical models.
MLE underpins applications in areas such as factor analysis, latent variable models, and time series, emphasizing challenges like non-concavity and computational feasibility.

Maximum likelihood estimation (MLE) is a foundational inferential procedure for parameter estimation in statistical modeling. It is defined by the selection of parameter values that maximize the likelihood function, which quantifies the probability (or probability density) of the observed data under a specified model. The MLE, when regularity and identifiability conditions are satisfied, exhibits desirable properties including consistency, asymptotic normality, and (local) efficiency.

1. Formal Definition and General Principles

For a parametric model $f(y;\theta)$ , given IID data $y_1,\dotsc,y_n$ , the log-likelihood is

$\ell(\theta) = \sum_{i=1}^n \log f(y_i;\theta)$

The maximum likelihood estimator $\hat\theta_{ML}$ is any solution to

$\hat\theta_{ML} = \operatorname{argmax}_{\theta \in \Theta}~\ell(\theta)$

and the necessary stationary condition is specified by the score equations,

$U(\theta) = \nabla_\theta \ell(\theta) = 0$

The observed Fisher information matrix, $I(\theta) = - \mathbb{E}_\theta[\nabla^2_\theta \ell(\theta)]$ , measures the local curvature of the log-likelihood and governs the asymptotic dispersion of the estimator. Under regularity, $\sqrt{n}(\hat\theta_{ML} - \theta_0) \to_d N(0, I(\theta_0)^{-1})$ and the Cramér–Rao bound is attained (Vella, 2018).

2. Solution Methods and Algorithmic Variants

Direct analytic solutions to the score equations are only possible for limited model classes. In practice, MLE computation typically employs iterative continuous optimization schemes such as:

Newton–Raphson or Fisher scoring methods (using first and second derivatives).
Expectation-Maximization (EM) for models with latent variables or missing data (Gonzalez et al., 2014).
Specialized algebraic and symbolic methods for systems with polynomial structure (Fukasaku et al., 2024).

For multidimensional, nonlinear, or constrained models, the MLE optimization can involve additional steps:

Checking the positive-definiteness of the Hessian at candidate solutions as in factor analysis (Fukasaku et al., 2024).
Application of Lagrange multipliers for constrained parameter spaces (e.g., enforcing probabilities sum to one or ensuring variance/covariance parameters remain positive-definite).
Handling non-convexity, potential multimodality, and boundary solutions.

An emerging approach is to reformulate the likelihood root-finding as a maximum-entropy convex program in a higher-dimensional simplex; this can safeguard against issues such as data separation in logistic regression and avoids Hessians and initial-guess sensitivity (Calcagnì et al., 2019).

3. Specialized Procedures and Model Classes

Factor Analysis and Polynomial Score Systems

In models like exploratory factor analysis, the first-order likelihood equations are inherently multivariate polynomial systems. Here, the solution proceeds by:

Writing the polynomial ideal of score equations.
Computing a Gröbner basis for triangularization.
Extracting all real solutions (including “improper” solutions with boundary or negative variance) via root-finding or cylindrical algebraic decomposition (CAD).
Screening solutions by positive-definiteness of the observed Fisher information and physical constraints on parameters (Fukasaku et al., 2024).

This “algebraic” approach delivers all stationary candidates and does not depend on initial values, unlike iterative numerical optimizers, but brings considerable symbolic complexity.

Latent Variable and Incomplete Data Models

Incomplete data models (e.g., controlled branching processes, latent class models) often require EM algorithms for maximum likelihood estimation. The canonical workflow entails:

E-step: Compute conditional expectations of missing data statistics under current parameter estimates.
M-step: Maximize the expected complete-data log-likelihood (the “Q function”) with respect to the parameters.
Iterate until convergence in parameter estimates or log-likelihood.

For branching processes, explicit E- and M-step formulas adapt to what is observed: full tree, generation sizes with known/unknown number of progenitors, or just generation sizes (Gonzalez et al., 2014).

Constrained and Regularized Optimization (e.g., ARMA Models)

Time series and autoregressive models (ARMA) often impose constraints (e.g., causality, invertibility). Classical procedures, such as the Jones reparametrization, enforce these via unconstrained parametrizations in partial autocorrelation space. Modern methods recast this as a box-constrained nonlinear optimization in the space of partial autocorrelations and partial MA coefficients, with optional Tikhonov ( $\ell_2$ ) regularization to penalize boundary-adjacent solutions, improve numerical conditioning, and enhance forecasting accuracy (Gangi et al., 2022).

4. Likelihood Approximation and Intractable Models

When the likelihood is analytically intractable (e.g., models with high-dimensional integrals), maximum approximated likelihood estimation (MALE) is employed:

Each marginalized or integrated likelihood term is approximated by numeric quadrature, Monte Carlo, quasi–Monte Carlo, or sparse grids.
Accuracy of the approximation must scale with the sample size (often, the number of quadrature points $r$ must grow at least as fast as sample size $n$ to maintain consistency and efficiency).
Under explicit regularity, the MALE matches the asymptotic efficiency of the true MLE as long as the approximation error decays sufficiently rapidly with $n$ (Griebel et al., 2019).

Table: Main Approximation Methods in MALE | Method | Error Rate | Suitability (d = dimension) | |-----------------------|--------------------|-----------------------------------------| | Monte Carlo | $O(r^{-1/2})$ | High $d$ , non-smooth integrands | | Quasi-Monte Carlo | $O(r^{-1+\epsilon})$ | Moderate $d$ , smooth integrands | | Gaussian quadrature | Exponential or $O(r^{-k})$ | $d=1$ , $C^k$ or analytic integrand| | Sparse Grids | $O(r^{-k/2})$ (approx) | $d$ small–moderate, smooth integrand |

5. Model-Specific MLE: Examples

Wishart Processes

MLE for diffusions such as the Wishart process is built via Girsanov transformation, requiring specialized likelihood derivations, solution of coupled matrix equations for drift parameters, and asymptotics that depend on ergodicity and parameter regimes. The rates of convergence differ dramatically across ergodic and nonergodic regimes, and optimality is established via local minimax theory (Alfonsi et al., 2015).

Truncated Univariate Distributions

For distributions like the truncated normal or lognormal, maximum likelihood estimation is hindered by ill-conditioning near limiting regimes (e.g., when the lognormal approximates a pure power law). Reparameterizations (e.g., using $\beta$ and $\psi$ instead of $\mu$ and $\sigma^2$ ) yield numerically stable Newton-type schemes tailored to handle parameter degeneracies and ensure robust convergence (Pueyo, 2014).

6. Practical and Computational Considerations

MLE solution procedures must address:

Non-concavity of the log-likelihood surface (with possible local maxima/minima).
Sensitivity to starting values when using iterative gradient-based optimizers.
Existence of “improper” solutions, especially when variances may become zero (common in factor analysis).
Ensuring computational feasibility in polynomial, high-dimensional, or non-linear systems (choice of method: numerical, symbolic, convex reformulation, or regularized optimization).
For EM and similar algorithms, convergence monitoring (tolerance levels for change in parameters/log-likelihood) and strategies for global optimization (multiple random starts, selection by highest achieved likelihood) (Gonzalez et al., 2014).
In approximated MLE, designing the approximation so that the error remains dominated by sampling fluctuations as $n$ increases (Griebel et al., 2019).

7. Extensions and Alternative Formulations

Recent research explores alternative formulations of the likelihood optimization problem:

Reparameterization of the MLE problem in a maximum entropy framework, leveraging convexity and avoiding reliance on first/second derivative information of the original log-likelihood. This can also guarantee finite estimates in numerically challenging settings such as data separation in logistic regression (Calcagnì et al., 2019).
Integration of model selection criteria (AIC, BIC) and penalized likelihoods in MLE frameworks for model identification and overfitting avoidance.
Use of algebraic geometry and computational algebra in the explicit enumeration of all stationary points for models with polynomial score systems (Fukasaku et al., 2024).

The procedural and computational choices in applying MLE are dictated by an overview of statistical model structure, constraints, computational tractability, and asymptotic theory. The referenced literature details both general and model-specific strategies for optimizing the likelihood, ensuring both statistical rigor and practical applicability.