Annealed Importance Sampling (AIS)

Updated 24 June 2025

Annealed Importance Sampling (AIS) is a stochastic algorithm central to the estimation of intractable partition functions in complex probabilistic models, most notably Markov Random Fields (MRFs) and Restricted Boltzmann Machines (RBMs). AIS constructs a sequence of intermediate distributions bridging a tractable initial distribution and the target distribution. Through Markov chain Monte Carlo (MCMC) transitions and the accumulation of importance weights, AIS produces unbiased partition function estimators, facilitating evaluation of model likelihoods, generative model comparison, and downstream learning tasks.

1. AIS in the Evaluation of Partition Functions for MRFs and RBMs

AIS has become a standard method for partition function estimation in energy-based models, with particular relevance to RBMs, DBMs, and deep networks where direct computation is infeasible due to the exponential state space. For a typical RBM, the joint energy is

$E(\mathbf{x}, \mathbf{h}) = -\mathbf{b}^T \mathbf{x} - \mathbf{c}^T \mathbf{h} - \mathbf{x}^T \mathbf{W} \mathbf{h}$

with partition function

$Z = \sum_{\mathbf{x}, \mathbf{h}} e^{-E(\mathbf{x}, \mathbf{h})}$

AIS defines a sequence $\{p_k\}_{k=0}^n$ of distributions such that

$p_j(\mathbf{x}) = p_0(\mathbf{x})^{1-\beta_j} p_n(\mathbf{x})^{\beta_j}$

for an annealing schedule $0 = \beta_0 < \ldots < \beta_n = 1$ . Each AIS run generates a trajectory through these distributions, accumulating weights as

$\hat{\omega}_i = \prod_{k=1}^n \frac{\tilde{p}_k(\mathbf{x}_k^{(i)})}{\tilde{p}_{k-1}(\mathbf{x}_k^{(i)})}$

The partition function estimator is

$Z_n \approx Z_0 \cdot \frac{1}{N_s} \sum_{i=1}^{N_s} \hat{\omega}_i$

with numerical stability maintained by averaging in the log domain. The choice of initial distribution and annealing schedule directly impacts variance, efficiency, and accuracy.

2. Limitations of AIS: Systematic Overestimation of Log-Likelihoods

While the AIS estimator for the partition function $Z$ is unbiased, its log—needed for likelihood evaluation—exhibits systematic bias due to Jensen's inequality: $E[\log \hat{Z}] \leq \log E[\hat{Z}] = \log Z$ Thus, AIS underestimates the partition function but correspondingly overestimates the log-likelihood: $\log p(x) = \log f(x) - \log \hat{Z}$ This overoptimism can mislead model selection, with potentially invisible errors, as extremely optimistic estimates can occur without clear diagnostic signals. This is especially problematic when comparing models where the true partition function differs by only a few nats.

3. Reverse AIS Estimator (RAISE): Stochastic Lower Bounds

To mitigate the optimistic bias of AIS, RAISE introduces a reverse-annealing approach providing a stochastic lower bound on the log-likelihood of an associated "annealing model." The key steps for models like RBMs are:

Initialization: Begin from the test data point, which is a tractable distribution in the case of RBMs.
Reverse-Propagation: Apply MCMC transitions and reverse the annealing path (from the target back to the base distribution).
Importance Weighting: Compute for data point $x$ :

$w = \frac{f_K(x)}{Z_0} \prod_{k=0}^{K-1} \frac{f_k(x_k)}{f_{k+1}(x_k)}$

Averaging: The mean of these weights across samples yields a stochastic lower bound on the log-likelihood of the annealing model.

For RBMs and similar models (with tractable conditionals), initialization and transition are straightforward, while deeper models require augmented MCMC procedures.

AIS and RAISE differ fundamentally: AIS produces conservative estimates of $Z$ , thus optimistic log-likelihoods, while RAISE produces conservative log-likelihoods, bracketing the true value when both are used.

4. Empirical Comparison of AIS and RAISE

Extensive experiments across RBMs, DBMs, and DBNs (on benchmarks like MNIST and Omniglot) reveal:

Small models: Both AIS and RAISE match the exact partition function and log-likelihood closely.
Large models: RAISE and AIS estimates typically agree within 1 nat, especially when a good initial distribution (e.g., data base rates) is chosen for AIS.
Pathological cases: Poor initialization or insufficient annealing steps induce large gaps; in some rare cases, the RAISE lower bound exceeds the AIS estimate (due to the annealing model outperforming a flawed original model).
Computational cost: RAISE requires more per-test-example computation (due to the need to "melt" from the data) but offers a trustworthy conservative bound.

A representative table (see Table 1 of the reference) summarizes these findings:

Model	RAISE	AIS	Gap (AIS - RAISE)
mnistPCD-500	-101.26	-101.28	-0.02
omniPCD-1000	-100.46	-100.46	0.00
MNIST DBM	-85.74	-85.67	0.07
Omniglot DBN	-100.78	-100.45	0.33

Agreement indicates estimator reliability; disagreement indicates estimator failure or model pathology.

5. Practical Implications for Generative Model Evaluation

RAISE enables safer evaluation of intractable generative models by providing a trustworthy lower bound on log-likelihood. The recommended best practice is to use both AIS and RAISE estimates and assess their agreement:

Close agreement: High confidence in the log-likelihood estimate and thus the evaluation of the model.
Large gap: Signals issues with the estimation process or model definition, necessitating further investigation.
Model selection: Bracketing the true log-likelihood with upper and lower stochastic estimators allows for robust model ranking without overreliance on potentially overoptimistic single estimators.

RAISE is straightforward to implement, requiring only minimal changes to existing AIS codebases (reverse the path and MCMC transitions), and does not require any additional types of Markov chain operators.

6. Broader Research Directions and Methodological Impact

The introduction of RAISE has broader methodological implications:

General-purpose evaluation: The RAISE approach is not RBM-specific and provides a general stochastic lower-bound technique for any model where AIS is used for partition function estimation.
Diagnostics: The pairwise analysis of AIS and RAISE outputs serves as a diagnostic tool for detection of model pathology or annealing process failure.
Estimator theory: This work opens the way for further theoretical developments of unbiased stochastic bounds, including sandwiching techniques and algorithmic bracketing of likelihood estimates for more general probabilistic models.
Algorithmic innovations: RAISE may inspire new variational schemes, importance sampling pathways, or hybrid estimators aimed at robustly quantifying model goodness in the complex, high-dimensional regimes typical of modern generative modeling.

7. Summary Statement

RAISE constitutes a principled solution to the problem of optimistic bias in Annealed Importance Sampling for generative model likelihood estimation. By providing a conservative lower bound that agrees closely with standard AIS in well-behaved settings, RAISE allows practitioners to both trust their model evaluations and confidently detect cases where AIS may be misleading. The pairwise bracketing approach sets a methodological precedent for rigorous, interpretable model assessment in probabilistic machine learning and statistical modeling.

PDF Markdown Bookmark Chat (Pro)