Papers
Topics
Authors
Recent
Search
2000 character limit reached

Temperature-Annealed Sampling Overview

Updated 16 January 2026
  • Temperature-annealed sampling is a family of methods that uses controlled temperature schedules to traverse complex, multimodal energy landscapes.
  • It employs strategies like AIS, population annealing, and ensemble annealing to enhance sampling efficiency and maintain mode coverage.
  • These techniques underpin applications in physics, optimization, generative modeling, and machine learning by adapting temperature to overcome barriers in energy landscapes.

Temperature-annealed sampling refers to a broad family of stochastic methods for generating samples from complex, multimodal distributions by systematically controlling one or more temperature-like parameters. These methods exploit temperature as a global or local smoothness parameter to traverse rough energy landscapes, enhance mixing, and manage mode coverage. The concept is foundational in statistical physics, computational optimization, probabilistic inference, and generative modeling, and underlies algorithms such as simulated annealing, annealed importance sampling, population annealing, ensemble annealing, integrated tempering sampling, and temperature-scheduled generative modeling. Temperature-annealed protocols are central in Monte Carlo integration, partition function estimation, quantum and thermal annealers, molecular simulation, low-resource language modeling, and reinforcement learning with autoregressive models.

1. Mathematical Principles of Temperature-Annealed Sampling

In statistical mechanics and probabilistic modeling, the canonical distribution at inverse temperature β=1/(kBT)\beta=1/(k_BT) is

pβ(x)=1Z(β)exp[βE(x)]p_\beta(x) = \frac{1}{Z(\beta)} \exp[-\beta E(x)]

where E(x)E(x) is the energy (or negative log-probability), Z(β)Z(\beta) is the partition function, and kBk_B is Boltzmann’s constant. Temperature-annealed sampling generally proceeds not at fixed β\beta, but by traversing a schedule β0<β1<<βL=β\beta_0<\beta_1<\dots<\beta_L=\beta^*, either in discrete jumps or continuous time.

The optimal annealing schedule can be derived from non-equilibrium statistical mechanics by minimizing the expected irreversible work, or equivalently, the thermodynamic length (Fisher-information metric):

W[β()]=120Tβ˙2g(β)dt,W[\beta(\cdot)] = \frac{1}{2} \int_0^T \dot\beta^2\, g(\beta)\,dt,

where g(β)=Varpβ[E]g(\beta)=\operatorname{Var}_{p_\beta}[E]. The constant-speed geodesic condition yields

β˙(t)g(β(t))=const\dot\beta(t)\,\sqrt{g(\beta(t))} = \text{const}

which prescribes rapid progress where fluctuations are small and slow progress near critical points, barriers, or phase transitions. This framework generalizes to multidimensional annealing in parameter space with a friction tensor that encodes parameter--parameter couplings and autocorrelations (Barzegar et al., 2024).

2. Core Algorithmic Schemes: AIS, Population Annealing, and Ensemble Annealing

Several canonical algorithms instantiate temperature-annealed sampling via different forms of population management, resampling, and importance weighting.

Annealed Importance Sampling (AIS)

AIS propagates RR^* independent trajectories through a scheduled sequence {βk}\{\beta_k\}, updating importance weights:

wrwr×exp[(βk+1βk)E(xr(k))].w_r \leftarrow w_r \times \exp[-(\beta_{k+1}-\beta_k)\,E(x_r^{(k)})].

Partition function ratios are estimated as Z(βK)/Z(β0)1RrwrZ(\beta_K)/Z(\beta_0) \approx \frac{1}{R^*}\sum_r w_r, and expectation values by weighted averages (Yasuda et al., 2020, Barzegar et al., 2024). AIS is robust to mode hopping via reweighting, though rare trajectories through high barriers dominate low-temperature estimators.

Population Annealing (PA)

Population annealing (Gessert et al., 2023, Barzegar et al., 2024) maintains a population of RR replicas, resampling at each temperature step. Replicas are duplicated or discarded with expected copy counts based on their relative Boltzmann weights, followed by MCMC equilibration:

Resampling Scheme Population Size Variance Function (sv(τ)\text{sv}(\tau))
Multinomial Fixed τ\tau
Systematic/Nearest-int. Variable ϵ(1ϵ)\epsilon(1-\epsilon)
Residual Mixed ϵ\epsilon
Stratified Fixed Varies ($1/3$ for τ1\tau\ge1)
Poisson Variable τ\tau

Effective population size and family-size growth metrics quantify statistical decorrelation and the cost of resampling. PA is especially powerful for equilibrating systems with nested or chaotic barriers, outperforming simple AIS in such settings (Gessert et al., 2023).

Ensemble Annealing

Ensemble annealing (Habeck, 2015) refines both the temperature schedule and an on-the-fly estimate of the density-of-states (DOS) by adaptively moving each ensemble so as to maintain constant relative entropy (KL divergence) DD between successive distributions:

D(pk+1pk)=(βkβk+1)Eβk+1+lnZkZk+1=DD(p_{k+1}\|p_k) = (\beta_k-\beta_{k+1})\,\langle E\rangle_{\beta_{k+1}} + \ln\frac{Z_k}{Z_{k+1}} = D

The algorithm uses nonparametric histogram reweighting ("WHAM") to update the DOS and selects the next βk+1\beta_{k+1} by root-finding so as to keep ensemble overlap optimal. Ensemble annealing unifies and generalizes simulated annealing, parallel tempering (REMD), and histogram-reweighting schemes, supporting applications in physical simulation and inference (Habeck, 2015).

3. Temperature-Annealed Sampling in Generative Models and Machine Learning

Temperature-annealed schedules are not restricted to physics; they find application in generative modeling, multilingual language modeling, and reinforcement learning.

Temperature-Annealed Boltzmann Generators (TA-BG)

TA-BG (Schopmans et al., 31 Jan 2025) addresses mode-collapse in normalizing-flow models by pretraining at high temperature via reverse KL, then gradually lowering temperature through importance-weighted forward KL updates:

  • Initial fit: DKL[qθ(x)pThigh(x)]D_{\text{KL}}[q_\theta(x) || p_{T_\text{high}}(x)]
  • Annealing: iterative reweighting using w(x)=exp[U(x)Δβ]w(x)=\exp[-U(x)\,\Delta\beta] and forward KL optimization at each scheduled Ti+1T_{i+1}
  • Schedule: geometric progression Ti+1=Thigh(Tlow/Thigh)i/KT_{i+1}=T_\text{high}(T_\text{low}/T_\text{high})^{i/K}

Sample coverage across metastable states is enhanced, ESS is preserved, and computational cost reduced (Schopmans et al., 31 Jan 2025).

Multilingual Training: Inverse-Temperature Schedules

mmBERT (Marone et al., 8 Sep 2025) employs a temperature-like exponent τ\tau to modulate sampling probabilities over languages:

p(s)=nτsj=1Lnjτsp_\ell^{(s)} = \frac{n_\ell^{\tau_s}}{\sum_{j=1}^L n_j^{\tau_s}}

with τ\tau annealed from $0.7$ to $0.3$ across phases. This shifts the data mixture from high-resource to near-uniform across hundreds to thousands of languages, unlocking zero-shot performance on low-resource tasks while avoiding noise-driven collapse in early training (Marone et al., 8 Sep 2025).

Sequential Decoding: Exploratory Annealed Decoding (EAD)

EAD (Yang et al., 6 Oct 2025) applies per-token temperature schedules in autoregressive LLM decoding, beginning with high temperature for exploration at sequence head and cooling to low temperature for sample quality and policy adherence:

τt=max{1+τmaxexp(t/d),  τmin}\tau_t = \max\{1 + \tau_{\text{max}} - \exp(t/d),\; \tau_{\min}\}

Plug-and-play integration with RL-based reward optimization offers superior exploration-exploitation balance versus fixed-temperature inference (Yang et al., 6 Oct 2025).

4. Temperature-Annealed Sampling in Annealers and Molecular Simulation

Quantum and Thermal Annealers

Quantum annealing hardware (e.g., D-Wave QA) can behave as tunable thermal Gibbs samplers at a hardware-specific effective temperature TeffT_{\text{eff}} (Nelson et al., 2021). Adjusting input energy scale αin\alpha_{\text{in}} and anneal time tat_a tunes TeffT_{\text{eff}}:

  • αin[0.2,0.4]\alpha_{\text{in}} \in [0.2,0.4] yields optimal sampling behavior
  • Effective temperature αout\alpha_{\text{out}} is extracted by minimizing total-variation distance between hardware output and ideal Gibbs law
  • Cumulative gauge averaging and subgraph selection are essential for noise mitigation and sampling fidelity

Temperature Estimation in Heuristic Annealers

Annealers may freeze out globally at higher temperature than intended, leading to local and global discrepancies with the true Boltzmann distribution. ML (energy-matching), MSE (correlation-matching), and MLPL (pseudo-likelihood) estimators are used to extract the best-fit operational temperature. Lightweight post-processing (blocked Gibbs) flattens local bias, sharpening global temperature estimation (Raymond et al., 2016).

Integrated Tempering Sampling (ITS)

ITS (Zhao et al., 2013) combines canonical distributions at multiple temperatures with non-Boltzmann prefactors nkn_k, constructing a composite bias that flattens barriers adaptively:

W(x)=k=1NnkeβkU(x)W(x) = \sum_{k=1}^N n_k e^{-\beta_k U(x)}

The temperature grid {βk}\{\beta_k\} and weights nkn_k are computed using short canonical averages, enforcing smooth energy histogram overlap (tt parameter controls exchange-like acceptance). ITS requires only a single trajectory and posthoc reweighting for observables, providing efficient coverage with minimal computational overhead (Zhao et al., 2013).

5. Empirical Performance, Schedule Tuning, and Practical Guidelines

Empirical studies on Ising/Potts models, spin glasses, peptides, and polymer chains demonstrate key tradeoffs in schedule design, population size, resampling method, and mixing. Key practical themes include:

  • Adaptive temperature steps based on the thermodynamic metric (variance or overlap) outperform fixed schedules (flat or fixed-overlap).
  • Nearest-integer or systematic resampling minimizes correlation and resampling noise; multinomial/Poisson degrade effective population fastest (Gessert et al., 2023).
  • Population size should scale at least as Ld/2L^{d/2} in dd-dim systems to control family-size growth.
  • Schedule targeting histogram overlaps $0.7$–$0.8$ enhances mixing and accuracy.
  • Monitoring effective sample size (ESS), replica family-size (ρt\rho_t), and observables as functions of β\beta enables dynamic adjustment of parameters, insertion of intermediate temperatures, and increased equilibration near critical regions (Barzegar et al., 2024, Gessert et al., 2023).
  • Light post-processing is universally recommended in annealer-based workflows to correct local errors before global temperature estimation (Raymond et al., 2016).

6. Applications and Extensions

Temperature-annealed sampling is foundational in:

Advances in adaptive schedule design, population resampling analysis, and hybrid integration with variational flows and MCMC continue to drive performance and generalization in both statistical and computational domains.


Temperature-annealed sampling synthesizes principles from thermodynamics, algorithmic control, and statistical estimation; it underpins state-of-the-art workflows in sampling, inference, optimization, and machine learning, with robust theoretical and empirical basis across research fields (Yasuda et al., 2020, Barzegar et al., 2024, Habeck, 2015, Gessert et al., 2023, Schopmans et al., 31 Jan 2025, Zhao et al., 2013, Raymond et al., 2016, Nelson et al., 2021, Marone et al., 8 Sep 2025, Yang et al., 6 Oct 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temperature-Annealed Sampling.