Instability in High-Dimensional Likelihoods

Updated 14 October 2025

Numerical instability in high-dimensional likelihoods is characterized by exponential computational complexity, ill-conditioned matrices, and vanishing likelihood values.
Mitigation strategies such as composite likelihoods, penalized regression, variational inference, and neural importance sampling have been developed to reduce bias and computational load.
Techniques including structure-exploiting factorizations, restricted optimization, and advanced Monte Carlo methods enable more accurate and feasible inference in complex high-dimensional models.

Numerical instability in high-dimensional likelihoods refers to the practical difficulties—including loss of precision, computational intractability, excessive bias or variance, and failure of optimization routines—that arise when evaluating, maximizing, or marginalizing likelihood functions involving a large number of variables or parameters. These issues are particularly acute in statistical inference for models such as multivariate extreme value distributions, large-scale latent variable models, profile likelihoods in high-energy physics, and high-dimensional Bayesian analyses. Numerical challenges stem from factors including dimensional combinatorial explosions, ill-conditioned matrices, vanishingly small likelihood values (equidismality), and poorly-scaling numerical integration. Researchers have developed a range of theoretical and algorithmic strategies to mitigate or circumvent these instabilities, employing methods such as composite likelihoods, penalized regression, variational inference, advanced Monte Carlo and quasi-Monte Carlo integration, neural importance sampling, and structure-exploiting matrix factorizations.

1. Sources and Manifestations of Instability

The core source of numerical instability in high-dimensional likelihoods is the exponential growth of computational complexity with respect to dimension. In the max-stable model context, for example, the full likelihood for $m$ components involves summing over all possible partitions of the index set, whose cardinality equals the $m$ th Bell number ( $B_m$ ), scaling super-exponentially with $m$ (Bienvenüe et al., 2014). Even for $m=10$ , this sum exceeds $10^5$ terms, making direct computation infeasible. In general, high dimensions can lead to:

Combinatorial explosion: Likelihoods that require enumeration or summation over exponentially many terms or partitions; e.g., the density $h(z) = \exp[-V^+(z)] \sum_{T\in\mathcal{I}}\prod_{B\in T}p(B; z)$ for max-stable laws (Bienvenüe et al., 2014).
Equidismality and identifiability issues: In chaotic systems, the likelihood surface becomes highly irregular (chaotic likelihoods), with almost all parameter values attaining extremely low likelihoods—rendering uniform or grid-based search futile (Du et al., 2014).
Poorly conditioned matrices: When fitting probabilistic ODE solvers or performing high-dimensional latent variable inference, relevant covariance or Fisher matrices become nearly singular or highly anisotropic, leading to catastrophic loss of numerical precision during inversion, decomposition, or optimization (Krämer et al., 2020).
Bias and incorrect inference: Inference procedures relying on event occurrence times in high-dimensional regimes may exhibit substantial bias due to finite-sample effects in the probabilities of occurrence partitions, especially under weak dependence (Wadsworth, 2014).
Vanishing or non-informative likelihoods: For likelihood-free methods and empirical likelihood, as dimension grows, the probability of matching observed data (or sufficient statistics) drops sharply, causing rejection rates to approach one and rendering estimates unstable or undefined (Kousathanas et al., 2015, Chang et al., 2018).
Poor scaling of integration methods: Both classical and quasi–Monte Carlo error rates for marginal likelihoods deteriorate rapidly with increasing dimension, and the advantage of low-discrepancy sequences (e.g., Halton) is offset by rapidly growing star discrepancy bounds (Tang, 26 Jun 2025).

2. Strategies Based on Composite and Alternative Likelihoods

To circumvent the combinatorial intractability of likelihood computation, composite likelihoods have been extensively developed:

Partition composite likelihoods: The full likelihood is replaced with a product over block-wise (lower-dimensional) likelihoods, e.g., $l_2(\theta; z) = \sum_{B \in \mathcal{P}} \log f_\theta(z_B)$ , where each block is of manageable size. Partitioning strategies (e.g., via clustering algorithms like PAM) exploit strong within-block dependence and approximate independence across blocks. This approach achieves a significant reduction in calculation time and memory requirements while retaining much of the dependence structure (Bienvenüe et al., 2014).
Pairwise and higher-order composite likelihoods: Inference is based on marginal (sub-vector) likelihoods over pairs, triples, or $k$ -tuples, e.g., $l_3(\theta; z) = \sum_{i<j} \log f_\theta(z_i, z_j)$ . Statistical efficiency increases with the order of the tuples, but computational demands increase combinatorially; careful selection (e.g., via distance-based weighting) is used to balance tractability and informativeness (Beranger et al., 2019).
Stephenson–Tawn (ST) likelihoods: Use event occurrence information to limit the partition sum to only those compatible with observed events, transforming the combinatorial sum into a single likelihood term dependent on the observed partition (Beranger et al., 2019, Wadsworth, 2014). While this can eliminate intractability for moderate $d$ , bias can arise if finite-sample probabilities for occurrence partitions deviate from their asymptotic limits.

Alternative frameworks include the threshold approach (using censored exceedances) and block maxima methods, both of which replace fully multivariate densities with simplified forms that focus on extremes or sufficiently informative subsets of the data (Bienvenüe et al., 2014).

3. Regularization, Penalization, and Matrix Factorization for Stability

In models with very high parameter counts (often $\gtrsim 10^5$ ), instability can arise from overfitting, lack of identifiability, and ill-conditioned optimizations. To address this, regularization and numeric stabilization are deployed:

Penalized likelihood with convex regularizers: Maximum entropy regularization (MEM), $\lambda \mathcal{R}_\mathrm{MEM}(x) = 2\lambda \sum_i [1 - x_i + x_i \ln x_i]$ , and $\ell_2$ -norm penalties are imposed on nuisance parameters (e.g., pixel- or energy-bin modulations) in high-dimensional spatial-spectral regression. These introduce stability by penalizing departures from prior structure without eliminating the flexibility to capture model misspecification (Storm et al., 2017).
Restricted optimization methods: L-BFGS-B quasi-Newton algorithms equipped with bound constraints (e.g., non-negativity on flux coefficients) leverage analytic gradients and only require a limited gradient history, making them scalable for problems involving $10^5$ or more parameters (Storm et al., 2017).
Semi-sparse Cholesky factorization: Variance estimation is stabilized by leveraging the sparsity structure of the Fisher information matrix, using fill-reducing permutations and sparse Cholesky decompositions to propagate parameter covariances without explicit large-matrix inversion (Storm et al., 2017).
Matrix square-root (Cholesky/QR) propagation: For time-series or ODE solvers, only square-roots of covariance matrices are propagated (in the Kalman or smoothing steps), preserving positive definiteness and eliminating negative eigenvalue artifacts caused by round-off in high order or small step-size situations (Krämer et al., 2020).
Coordinate preconditioning: In ODE solvers, a carefully chosen invertible transformation (scaling with step size and order) is used to eliminate problematic dependencies (e.g., extreme anisotropy) in state transition and noise covariance matrices, reducing their condition numbers substantially (Krämer et al., 2020).

4. Likelihood-Free Inference, Importance Sampling, and Machine Learning Methods

Likelihood-free methods and machine learning-driven importance sampling approaches directly tackle instability arising from intractable or vanishing likelihood surfaces:

ABC-PaSS (Approximate Bayesian Computation with Parameter Specific Statistics): Rather than matching all summary statistics in high-dimensional space, only one parameter is updated per iteration and the acceptance step is based on sufficient, parameter-specific low-dimensional statistics. This maintains high acceptance rates and yields stable posterior estimation even for models with many parameters (Kousathanas et al., 2015).
Neural importance sampling (NIS) and normalizing flow models: A tractable, invertible neural network (the flow) is trained to emulate the complex high-dimensional likelihood surface, enabling efficient sampling and stable weight calculation. Subsequently, samples are reweighted using $w(x) = p(x) / q(x)$ , where $q(x)$ is the learned proposal, and the ML-based optimizer smooths noisy likelihood surfaces via gradient-based maximization using automatic differentiation (Heimel et al., 1 Nov 2024). This radically reduces variance and computational time in profile likelihood evaluation (e.g., from weeks on a CPU cluster to hours on a single GPU), facilitating routine application to global particle physics fits with $O(40)$ parameters, but is sensitive to issues like mode collapse and hyperparameter tuning.
Pseudo-orbit Data Assimilation for chaotic systems: Embedding observations in higher-dimensional sequence space and minimizing dynamical mismatch (rather than optimizing the likelihood directly) enables importance sampling of states that are dynamically plausible, thus resolving the "equidismality" in high-dimensional, chaotic likelihoods (Du et al., 2014).

5. High-Dimensional Integration: Monte Carlo and Quasi–Monte Carlo Approaches

Computing normalizing constants, marginal likelihoods, or Bayesian evidence in high dimensions is fraught with numerical instability arising from poor scaling of integration error:

MC and QMC theoretical error bounds: For fixed dimension $p$ , the MC relative error with $m$ samples and $n$ data points scales as $\mathcal{O} \left( \sqrt{n\log(n)^{p+1} \log(m) / m} \right)$ , while for QMC with low-discrepancy grids, the bound is $\mathcal{O}( \log(m)^p \log(n)^{p/2} / m )$ (Tang, 26 Jun 2025). For sufficiently large $m$ and moderately sized $p$ , QMC outperforms MC if the condition $\log(m)^{p-1/2} / \sqrt{m n\log(n)} < 1$ is satisfied.
Curse of dimensionality and low-discrepancy sequences: As $p\to\infty$ , the constants in QMC error rates (e.g., with Halton sequences) deteriorate exponentially, e.g., $D_m^*(x_1, ..., x_m) = O((4e)^p p^{3/2} \log(p) \log(m)^p / m)$ . Although the asymptotic convergence in $m$ is theoretically better for QMC, in high dimensions the prefactor swamps the advantage, and MC can provide more favorable practical error rates (Tang, 26 Jun 2025). This highlights that QMC's advantage is most pronounced when $p$ is fixed or moderate and $n$ is large but is neutralized as $p$ increases.

6. Empirical Likelihood, Over-Identification, and Model Testing

Empirical likelihood methods in high dimensions are destabilized by over-identification (large numbers of moment conditions), high-dimensional nuisance parameters, and associated ill-conditioned matrices:

Transformation and projection methods: Rather than directly optimizing over all estimating equations, a carefully chosen linear transformation projects the original estimation function onto a lower-dimensional space, emphasizing components relevant to parameters of interest while rendering the impact of high-dimensional nuisance parameter estimation asymptotically negligible (Chang et al., 2018). For example,

$f^{A_n}(X; \theta) = A_n g(X;\theta)$

where $A_n$ is selected to be nearly orthogonal to nuisance parameter gradients, thus stabilizing confidence region estimation.

Marginal likelihood ratio testing: For over-identified models, tests are constructed based on the maximum marginal empirical likelihood ratio (evaluated for each estimating equation individually), avoiding full-rank matrix inversion and securing stability even when the number of estimating equations or parameters grows exponentially with sample size (Chang et al., 2018).

7. Simulation Studies and Empirical Comparisons

Simulation studies are routinely used to compare the finite-sample performance and numerical stability of alternative inference procedures:

Partition composite likelihoods vs pairwise methods: Monte Carlo experiments confirm that smart partitioning (e.g., MpcLE) achieves lower bias and variance than pairwise composite likelihood estimators (MpmLE), with efficiency losses being limited except in the most complex dependence structures (Bienvenüe et al., 2014).
Bias correction via finite-sample partition probability refinements: Second-order corrections substantially reduce bias in estimation for weakly dependent high-dimensional max-stable models, and the computational overhead is manageable compared to the improvement in statistical accuracy (Wadsworth, 2014).
Flow-based and neural-based profile likelihood methods: Compared with grid-based or classical CPU-bound approaches, neural importance sampling with gradient-based maximization delivers orders-of-magnitude improvements in precision and timing for large-scale SMEFT LHC analyses (Heimel et al., 1 Nov 2024).

In summary, numerical instability in high-dimensional likelihoods arises from a confluence of combinatorial complexity, ill-conditioning, vanishing likelihoods, and poor scaling of error rates. State-of-the-art research leverages composite and alternative likelihoods, penalized optimization, structure-exploiting factorizations, parameter-specific or projected inference, neural samplers, and advanced integration schemes to stabilize inference, maintain computational feasibility, and preserve statistical efficiency. The balance between algorithmic tractability, statistical validity, and resource requirements continues to motivate innovations in both theory and practice across application domains.