Annealed Importance Sampling Bootstrap
- Annealed Importance Sampling Bootstrap is a family of training algorithms that embed AIS into normalizing flow optimization, enhancing mass covering by reducing the variance of importance weights.
- It employs a ladder of intermediate densities and adaptive HMC steps to iteratively refine the flow, yielding robust low-variance gradient estimates.
- FAB demonstrates state-of-the-art performance in high-dimensional multimodal density estimation and computational physics by efficiently reusing expensive target evaluations.
Annealed Importance Sampling Bootstrap (FAB) refers to a family of training algorithms for normalizing flows that embed annealed importance sampling (AIS) within the flow optimization loop, specifically targeting mass-covering objectives to reduce the variance of importance weights. The central innovation is the bootstrapping interaction between the flow and the AIS sampler: AIS leverages the current flow as proposal, providing weighted samples that sharpen low-variance -divergence gradient estimates, which in turn iteratively refine the flow. The methodology has established state-of-the-art performance for learning unnormalized, multimodal, and high-dimensional densities from only black-box target density evaluations, with significant applications in computational physics and high-energy simulation (Midgley et al., 2021, Midgley et al., 2022, Kofler et al., 2024).
1. Principle of FAB: Annealed Importance Sampling–Flow Bootstrap
FAB augments normalizing flow training with annealed importance sampling. A ladder of intermediate densities interpolates between the flow surrogate and a final “optimal” mass-covering target (e.g., for in the -divergence). The intermediates are
Particles are initialized by drawing from , then propagated through this sequence via Markov transition kernels (often HMC), each stage accompanied by incremental weight update
The full AIS weight for each sample chain is
0
Normalized weights then define an empirical (importance-weighted) estimator for any function of interest. The bootstrapping occurs because as 1 improves, the efficiency and stability of AIS—and thus the gradient estimation—simultaneously improve (Midgley et al., 2022, Midgley et al., 2021, Kofler et al., 2024).
2. Mass-Covering 2-Divergence Objective and Its Surrogate
FAB directly targets the minimization of the 3-divergence for 4,
5
Minimization of 6 is equivalent to minimizing the variance of the importance weights 7.
Instead of drawing samples only from 8, FAB samples from the AIS kernel targeting 9 and uses these (together with their normalized AIS weights) to form a low-variance, mass-covering surrogate loss:
0
where 1. In practice, gradients are stopped with respect to the AIS weights and samples, and only backpropagated through 2 (Midgley et al., 2022, Midgley et al., 2021, Kofler et al., 2024).
3. Algorithmic Procedure and Practical Implementation
A high-level FAB iteration with optional sample replay buffer is as follows:
3
The replay buffer allows for efficient gradient reuse and substantially reduces the number of required expensive target density evaluations. Each kernel 3 is typically a short HMC sequence, adaptively tuned (target acceptance 4 0.65), followed by optional resampling based on weight degeneracy.
In applications, small numbers of intermediate distributions (5) suffice even at moderate-high dimension, and flow architectures typically use coupling/spline layers with moderate capacity (e.g., 14 layers and 400 hidden units per conditioner in 8D) (Kofler et al., 2024, Midgley et al., 2022).
4. Empirical Validation and Comparative Performance
FAB demonstrates superior performance to maximum-likelihood flow training (fKLD), reverse-KL flow training (rKLD), and grid-based VEGAS+ sampling. Typical evaluation metrics include:
- Forward KL 6
- Importance sampling efficiency 7 (normalized ESS)
- Integral estimates via weighted samples
For an 8D matrix element sampling task (Kofler et al., 2024):
| Method | 8 | 9 (%) 0 | # Target evals |
|---|---|---|---|
| rKLD | 7.74 (0.02) | 56.5 | 1 |
| FAB (w/o buf) | 7.79 (0.03) | 84.3 | 2 |
| FAB (w/ buf) | 7.747 (0.002) | 90.6 | 3 |
On lower-dimensional multimodal problems, only FAB and ML-trained flows cover all modes, achieving high ESS (e.g., 4\% on 25-mode 2D Gaussians, 5\% for 30D Boltzmann distributions), where alternative methods fail with near-degenerate ESS and significant mode collapse (Midgley et al., 2021, Midgley et al., 2022, Kofler et al., 2024).
5. AIS Bootstrap Variants and Extensions
FAB admits flexible choices for the intermediate density ladder, annealing (linear or geometric in 6), and MCMC kernel (HMC, Metropolis). Alternative 7-divergences (8) can be substituted, though 9 yields optimal mass covering empirically.
Replay-buffer FAB variants further reduce the number of required target evaluations by reusing AIS-weighted samples through priority sampling and log-density reweighting. There are also suggestions for combining FAB with other flow architectures such as CRAFT and SNF, and adapting the AIS procedure to differentiable variants that further reduce gradient variance (Midgley et al., 2022, Kofler et al., 2024).
6. Limitations and Domain-Specific Constraints
FAB requires access to 0 to enable HMC in the annealing steps. This is now feasible in differentiable matrix element modeling frameworks (e.g., MadJAX, ComPWA), but constrains applicability in non-differentiable domains (Kofler et al., 2024). HMC and flow capacity hyperparameters must be empirically tuned to balance mixing, computational cost, and acceptance rates. Numerical stability issues may arise if 1 is too small in high-density regions of 2; defensive regularization or early NaN filtering may be required.
Compute costs are dominated by HMC and target evaluations, making replay buffering essential in high-dimensional or expensive physical simulation tasks.
7. Theoretical Significance and Broader Context
FAB demonstrates that integrating AIS into the training loop of normalizing flows with mass-covering divergence minimization results in both enhanced robustness to mode collapse (discovering underrepresented or missing modes) and marked reductions in the variance of importance weights. Theoretical underpinnings guarantee that, under mixing, the approach yields consistent gradient estimates for loss minimization; practical surrogate biases can be made arbitrarily small with increasing minibatch size (Midgley et al., 2021).
Within computational physics, especially for tasks such as particle-physics matrix element simulation, FAB provides a scalable, sampler-independent framework for high-fidelity surrogate learning without reliance on large pre-computed MC datasets. The method’s principled combination of flow-based modeling, importance sampling, and replay reweighting outperforms conventional approaches in both sample efficiency and distributional accuracy (Kofler et al., 2024, Midgley et al., 2022, Midgley et al., 2021).