Learned Harmonic Mean Estimator (LHME)

Updated 23 January 2026

LHME is a Bayesian evidence estimator that learns an internal importance sampling target via normalizing flows to achieve finite variance.
It employs temperature scaling and support control to mitigate the infinite variance issues inherent in classical harmonic mean estimators.
LHME is sampler-agnostic and works with posterior samples from MCMC or variational inference, providing accurate model comparison across diverse benchmarks.

The Learned Harmonic Mean Estimator (LHME) is a robust, scalable, and flexible technique for estimating Bayesian evidence (marginal likelihood) for model comparison. LHME resolves the infinite variance pathologies of classical harmonic mean estimators by learning a suitable internal importance sampling target distribution, typically parameterized using expressive normalizing flow architectures. The estimator requires only posterior samples, rendering it agnostic to the sampling strategy and enabling direct applicability to saved MCMC chains or variational inference outputs. LHME achieves finite variance and unbiasedness through temperature scaling and careful support control of the learned target, with demonstrated accuracy and computational efficiency across benchmarks from low-dimensional toy models to high-dimensional cosmological datasets (Polanska et al., 2024, Polanska et al., 2023, McEwen et al., 2021, Hu et al., 21 Jan 2026).

1. Bayesian Model Evidence and the Harmonic Mean Estimator

Given observed data $D$ , model parameters $\theta$ , likelihood $p(D\,|\,\theta)$ , and prior $p(\theta)$ , the Bayesian evidence is the normalizing constant for the posterior: $Z = \int p(D\,|\,\theta)\,p(\theta)\,d\theta.$ This quantity determines the relative posterior probability of competing models via the Bayes factor. Direct computation of $Z$ by brute-force quadrature is intractable in moderate to high-dimensional parameter spaces.

The classical harmonic mean estimator (HM): $Z^{-1} = \mathbb{E}_{\theta \sim p(\theta | D)}\left[\frac{1}{p(D | \theta)}\right]$ leads to the Monte Carlo estimator: $\hat{\rho}_{\rm HM} = \frac{1}{N} \sum_{i=1}^N \frac{1}{p(D | \theta_i)}, \qquad \theta_i \sim p(\theta | D),$ and $\hat{Z}_{\rm HM} = \hat{\rho}_{\rm HM}^{-1}$ . However, when tail behavior of the posterior is thinner than the prior, importance weights become highly variable, frequently leading to infinite variance and estimator collapse (Polanska et al., 2024, Polanska et al., 2023, McEwen et al., 2021).

2. Theoretical Foundation and Derivation of LHME

A generalization introduces an arbitrary positive normalized density $\varphi(\theta)$ supported within the posterior: $Z^{-1} = \mathbb{E}_{\theta \sim p(\theta | D)}\left[\frac{\varphi(\theta)}{p(D | \theta)\,p(\theta)}\right].$ The empirical LHME estimator is: $\hat{\rho}_{\varphi} = \frac{1}{N} \sum_{i=1}^N \frac{\varphi(\theta_i)}{p(D | \theta_i)\,p(\theta_i)}, \qquad \theta_i \sim p(\theta | D),$ with evidence estimate $\hat{Z}_{\text{LHME}} = 1/\hat{\rho}_{\varphi}$ .

The key insight is that the optimal choice $\varphi^*(\theta) = p(D | \theta)\,p(\theta)/Z$ yields constant importance weights and zero variance but requires access to $Z$ , the quantity being estimated. LHME circumvents this by learning a surrogate density $q_\phi(\theta)$ using machine learning techniques, such that $\text{supp}(q_\phi)\subseteq \text{supp}(p(\theta | D))$ and $q_\phi$ has sufficiently light tails (Polanska et al., 2024, Polanska et al., 2023, McEwen et al., 2021, Hu et al., 21 Jan 2026).

3. LHME with Normalizing Flows: Training and Tail Control

Normalizing flows parameterize $q_\phi(\theta)$ using invertible, differentiable transformations composed of simple layers, e.g. Real NVP or rational-quadratic spline blocks. The flow is trained on posterior samples by minimizing the forward KL divergence: $D_{\text{KL}}[p(\theta | D) \| q_\phi(\theta)] = -\mathbb{E}_{\theta \sim p(\theta | D)}[\log q_\phi(\theta)] + \text{const}.$ Maximum likelihood estimation is performed via stochastic gradient descent (Adam optimizer). After initial fitting, $q_\phi(\theta)$ is "concentrated" by reducing the temperature $T<1$ of the base distribution; for flows mapping $z \sim \mathcal{N}(0,T I)$ via $f_\phi$ , the induced density is: $q_{\phi,T}(\theta) = q_0(z)\,|\det J_{f_\phi}(z)|^{-1}, \qquad q_0(z) = \mathcal{N}(z\,|\,0,T I).$ Lowering $T$ decreases tail thickness, ensuring all probability mass remains inside the posterior and delivers finite-variance importance weights (Polanska et al., 2024, Polanska et al., 2023).

4. Algorithmic Implementation

The LHME pipeline proceeds as follows:

Posterior Sampling: Obtain $N_{\text{post}}$ samples $\{\theta_i\}$ via any MCMC or variational method.
Data Splitting: Divide into training ( $M$ samples) and inference ( $N$ samples) sets.
Flow Training: Initialize flow parameters $\phi$ ; train $q_\phi(\theta)$ by minimizing the Monte Carlo negative log-likelihood over the training set.
Tail Concentration: Select $T<1$ (e.g., $T=0.8$ –$0.9$) and define $q_{\phi,T}$ via temperature scaling.
Evidence Estimation: Compute weights $w_i = q_{\phi,T}(\theta_i)/[p(D|\theta_i)\,p(\theta_i)]$ on the evaluation set; estimate $\hat{\rho} = (1/N) \sum_{i=1}^N w_i$ , then $\hat{Z} = 1/\hat{\rho}$ .
Error Quantification: Sample variance of $w_i$ yields uncertainty estimates; propagate to $\hat{Z}$ .

This procedure is implemented in the open-source Python package harmonic (Polanska et al., 2024, Polanska et al., 2023, McEwen et al., 2021).

5. Theoretical Properties and Diagnostics

LHME exhibits desirable theoretical properties:

Unbiasedness: $\mathbb{E}[\hat{\rho}] = 1/Z$ for any $\varphi$ normalized and supported inside the posterior.
Variance Control: Variance is finite if $\varphi$ (equivalently, $q_{\phi,T}$ ) has support strictly inside the posterior and lighter tails.
Consistency: As $N\to\infty$ , estimator converges in probability to the true evidence under standard Monte Carlo arguments (Polanska et al., 2024, Polanska et al., 2023, Hu et al., 21 Jan 2026).

Diagnostic strategies monitor tail behavior: empirical Pareto- $k$ statistics on the weights, kurtosis of chain-wise evidence estimates, and sensitivity to splits in train/test sets. Infinite-variance warning thresholds (e.g., $k \gtrsim 0.7$ ) indicate unreliable outputs.

6. Empirical Performance and Applicability

Numerical experiments demonstrate LHME's empirical accuracy and robustness:

Problem	LHME Configuration	Accuracy vs. Benchmark
2D Rosenbrock	Real NVP, $T=0.9$	Ground truth recovered; HM diverges
Normal–Gamma model	Spline/flow, variable $T$	Tracks analytic evidence; robust to $T$
Pima Indian logistic regression	Real NVP flow	$\log\mathrm{BF}\approx2.63$ matches RJMCMC
Radiata pine linear regressions	Spline-flow LHME	Recovers $\log\mathrm{BF}=8.424$
21D Gaussian	RQ-spline, $T=0.8$	Unbiased over 100 repeats
10D Rosenbrock	Several flows, $T=0.5$ –$0.8$	Competitive with nested sampling, faster
DES Y1 cosmology (20–21d)	Metropolis+LHME, GPU/CPU	$\sim0.03$ log-unit agreement; cost $\sim16$ min vs. $\sim94$ hrs PolyChord

In all cases, LHME+flows remain unbiased, yield correct error bars, and do not require fine-tuning of $T$ . Benchmarks against nested sampling (PolyChord), RJMCMC, or ground-truth integration verify consistency and efficiency (Polanska et al., 2024, Polanska et al., 2023, McEwen et al., 2021, Hu et al., 21 Jan 2026).

7. Advantages, Limitations, and Potential Extensions

Advantages:

Sampler-agnostic: Requires only posterior samples, compatible with MCMC or variational inference.
Flexibility: Normalizing flows model complex, multimodal, highly correlated posteriors.
Scalability: Empirical accuracy up to 21 dimensions; performance demonstrated for cosmological applications and high-dimensional Gaussians.
Reusability: Saved posterior chains can be directly used for evidence estimation.

Limitations and Open Questions:

Temperature Selection ( $T$ ): Although $T\approx0.9$ is robust, automated selection could further improve ease of use.
Expressivity vs. Overfitting: High-dimensional posteriors may challenge flow expressivity and tail control.
Computational Cost: Large flows require significant training effort if posterior sample counts are high.

Potential Extensions:

Exploiting more advanced flow architectures (e.g., continuous-time/residual flows) for improved tail management in very high dimensions.
Automated joint tuning of flow parameters and temperature to minimize estimator variance.
Integration into sequential Monte Carlo or variational inference pipelines for online model comparison.
Application to likelihood-free inference settings via ratio estimation.

LHME fundamentally decouples evidence estimation from posterior sampling, facilitating computationally efficient prior sensitivity analyses and model selection across scientific domains (Polanska et al., 2024, Polanska et al., 2023, McEwen et al., 2021, Hu et al., 21 Jan 2026).