Best Gaussian Approximation Methods

Updated 30 December 2025

Best Gaussian approximation is the optimal strategy of selecting a Gaussian measure that minimizes divergence metrics, such as KL and Wasserstein, from a target distribution.
It leverages variational and minimax principles to derive precise error bounds and convergence rates, especially in high-dimensional or dependent data settings.
Practical algorithms—including finite mixtures, normalizing flows, and geometric mappings—ensure robust and computationally efficient Gaussian approximations.

The best Gaussian approximation refers to optimal strategies, rates, and algorithms for approximating a target probability law, process, dataset, or function by a Gaussian distribution or a mixture thereof, under rigorous metrics such as Kullback–Leibler divergence, total variation, Wasserstein distance, or $L^2$ norm. This is foundational in statistical inference, signal processing, machine learning, and Bayesian inverse problems, and exhibits deep connections to optimal transport, information geometry, approximation theory, and empirical process theory. Recent research provides minimax rates, constructive algorithms, and precise error bounds for high-dimensional, dependent, and anisotropic settings.

1. Variational and Minimax Principles for Gaussian Approximation

The canonical definition of best Gaussian approximation is the minimizer of the Kullback–Leibler divergence (KL) from a class of Gaussian laws to a target probability law $\mu$ on $\mathbb{R}^d$ or, more generally, on a Hilbert space. For a target $\mu$ with density $\propto \exp(-V_1^\varepsilon(x)/\varepsilon - V_2(x))$ , the optimal Gaussian $\nu^*=N(m^*,\Sigma^*)$ satisfies

$(m^\ast,\Sigma^\ast) = \arg\min_{(m,\Sigma)\in\mathbb{R}^d\times\mathcal{S}^+_d} D_{\mathrm{KL}}(N(m,\Sigma) \Vert \mu_\varepsilon)$

where $D_{\mathrm{KL}}$ is the KL divergence. The explicit gradient conditions yield: $\nabla_m D_{\mathrm{KL}} = \frac{1}{\varepsilon} \mathbb{E}_{N(m,\Sigma)}[\nabla V_1^\varepsilon(X)] + \mathbb{E}_{N(m,\Sigma)}[\nabla V_2(X)] = 0$

$\partial_\Sigma D_{\mathrm{KL}} = \frac{1}{2\varepsilon} \mathbb{E}_{N(m,\Sigma)}[D^2 V_1^\varepsilon(X)] + \frac{1}{2} \mathbb{E}_{N(m,\Sigma)}[D^2 V_2(X)] - \frac{1}{2}\Sigma^{-1} = 0$

These conditions generalize to infinite-dimensional function spaces, where the best Gaussian $\nu=N(m,C)$ minimizes $D_{\mathrm{KL}}$ subject to equivalence with a reference Gaussian $\mu_0$ (Pinski et al., 2014, Lu et al., 2016).

2. Rates and Optimality in High Dimensions and Dependencies

For a sequence of i.i.d. or dependent random vectors $X_1,\dots,X_n\in\mathbb{R}^p$ , best Gaussian approximation bounds sharpen classical strong-coupling results. In the i.i.d. case with $p$ th moment and short-range dependence $\chi>0$ (functional dependence measure), the minimax coupling rate interpolates between $n^{1/2}$ (slow decay) and $n^{1/p}$ (rapid decay), generalizing Komlós–Major–Tusnády and Bentkus–Chernozhukov lower bounds: $\max_{1\le i\le n} |S_i - G_i| = o_P(n^{1/r(\chi,p)})$ where $r(\chi,p)$ is explicit (Karmakar et al., 2020). In high dimensions ( $p\gg n$ ), for independent mean-zero vectors under a restricted sub-Gaussian norm, one achieves

$\|S_n - Z\| = O_\mathbb{P}(p^{3/2}/\sqrt{n})$

uniformly in $p,n$ , with explicit constants, closing the gap left by previous dimension-dependent CLT results (Buzun et al., 2021).

3. Empirical Approximation and Complexity Floor

The empirical approximation of a standard Gaussian law in $\mathbb{R}^d$ by its empirical counterpart over a (potentially highly structured) subset $A\subset S^{d-1}$ yields an optimally tight uniform bound: $\sup_{x\in A,t\in\mathbb{R}} \left| F_{m,x}(t) - \Phi(t) \right| \leq \Delta + \sigma(t) \sqrt\Delta$ where $F_{m,x}(t)$ is the empirical CDF, $\sigma(t)=\sqrt{\Phi(t)(1-\Phi(t))}$ , and $\Delta \geq \Delta_0 \approx \gamma_1(A)/m$ with $\gamma_1(A)$ Talagrand's complexity. Both the error form and the threshold are minimax optimal, firmly linking approximation rates to the geometric complexity of the target set. This analysis further yields Wasserstein–2 (𝓦₂) bounds with precise quantile–coordinate rigidity for random Gaussian embeddings (Bartl et al., 2023).

4. Best Approximation by Finite Gaussian Mixtures

For arbitrary location–mixtures of Gaussians, the best finite $m$ -component mixture approximation within $f$ –divergence error $\epsilon$ is characterized by tail properties of the mixing law:

Compactly supported mixing distribution $[-M,M]$ :

$m^*(\epsilon) \asymp \frac{\log(1/\epsilon)}{\log(1 + \frac{1}{M} \sqrt{\log(1/\epsilon)})}$

Subgaussian tail parameter $\sigma$ :

$m^*(\epsilon) \asymp \sigma\log(1/\epsilon)$

Subexponential parameter $\beta$ :

$m^*(\epsilon) \asymp \beta (\log(1/\epsilon))^{3/2}$

Attainability leverages local moment matching and Gauss quadrature, while converses derive from low-rank, spectral analysis of trigonometric moment matrices and Toeplitz operators. These rates correct prior errors in $m/\sigma^2$ exponents for Gaussian–Gaussian mixture approximation (Ma et al., 2024).

5. Geometric and Universal Gaussian Approximation

Approximating general laws via pushforwards of Gaussians under diffeomorphisms (“ReparamGA”) or Riemannian exponential maps (“RiemannGA”) yields universal expressivity: $\forall\,p(x) > 0, \exists\,\phi \text{ diffeomorphism}: \quad \phi_* \mathcal{N}(0,I) = p(x)$ The construction employs the Rosenblatt transform and is exact for smooth positive densities. While a single universal mapping for a family $\{p_\alpha\}$ is obstructed by Chentsov's theorem, minimizing the expected divergence over a family yields nearly best geometric approximations. Practical algorithms are now built around normalizing flows (learned diffeomorphisms) and geometric Laplace–approximation, balancing tractability and expressive power (Costa et al., 1 Jul 2025).

6. Gaussian Approximation for Diffusions, Processes, and Master Equations

For small-noise diffusions, the KL–optimal Gaussian approximation aligns the mean and covariance with solutions to deterministic and Lyapunov ODEs, driving the leading order KL divergence to $O(\varepsilon)$ for noise parameter $\varepsilon$ . The error in total variation is $O(\varepsilon^{1/2})$ , and practical computation leverages closed-form ODE recursion for mean and variance (Sanz-Alonso et al., 2016). Similar advantages apply to master equations for Markov jump processes, where Gaussian closure reduces mean error from $O(\Omega^0)$ (van Kampen) to $O(\Omega^{-1/2})$ for system size $\Omega$ , with variance preserved at $O(\Omega^{1/2})$ error across both methods (Lafuerza et al., 2010).

7. Approximation of Alpha–Stable and Non-Gaussian Laws

For $\alpha$ –stable distributions, the LePage series expansion yields a “truncation + Gaussian tail” approximation, minimizing Kolmogorov distance to $O(1/c)$ where $c$ is the truncation level. This leads to sharply computable error bounds for inference: $\Delta(X,\widehat{X}) \leq B_5(c,\alpha,N)$ where $\widehat{X}$ denotes the truncated series plus matched Gaussian tail. This approach uniformly outperforms pure truncation and mixture-of-normals when $\alpha<2$ (Riabiz et al., 2018).

8. Structural, Algorithmic, and Error Analysis Tools

Several algorithmic paradigms are now established:

Sum-of-exponentials rational approximations achieve near-optimal geometric error decay ( $O(7.5^{-N})$ for $N$ modes) for 1D Gaussian kernel transforms (Jiang, 2019).
Separable, area-matching plus weighted least squares fitting provides efficient and accurate Gaussian parameter estimation for sampled data, delivering closed-form $\sigma$ and robust iterative schemes (Al-Nahhal et al., 2019).
$N$ -term Gaussian mixtures in $L_2(\mathbb{R}^2)$ can universally match curvelet sparsity rates and are “universal” for anisotropic classes, via two-stage approximation and a Fourier-domain analysis that exploits vanishing moments and parabolic scaling (Erb et al., 2019).
Moment-matching and Gauss–Hermite quadrature crucially surpass naive truncation for compact approximation, reducing Laplace transform error from $e^{-\Theta(a^2)}$ to $e^{-\Theta(a^2\log a)}$ , which enables “super-flat” mixtures with uniformly bounded derivatives (Polyanskiy et al., 2020).

9. Outlook and Open Problems

Research is ongoing on explicit determination of constants in the mean and tail exponents for best approximation rates, extension to multidimensional and general location–scale mixtures (where moment–tensor complexity grows), and nonconvexity and mode-capture properties for nonuniqueness in infinite-dimensional KL minimization. Further connections to optimal transport rigidity, empirical process minimax bounds, information geometry, and scalable algorithms remain central for high-dimensional Bayesian inference, machine learning model compression, and functional data analysis.

Selected Table: Minimax Rates for Gaussian Mixture Approximation (Ma et al., 2024)

Mixing law class	Min. $m^*(\epsilon)$ for error $\leq \epsilon$	Typical application
$[-M,M]$ (compact)	$\frac{\log(1/\epsilon)}{\log(1+1/M \sqrt{\log(1/\epsilon)})}$	Signal constellations, quadrature
$\\|X\\|_{\psi_2}\leq\sigma$ (subgaussian)	$\sigma \log(1/\epsilon)$	Channel noise, robust statistics
$\\|X\\|_{\psi_1}\leq\beta$ (subexponential)	$\beta (\log(1/\epsilon))^{3/2}$	Heavy-tailed processes

This summary integrates the current state of theory, practical schemes, sharp bounds, and geometric insights for the best Gaussian approximation in measure, data, function, and process spaces.