Papers
Topics
Authors
Recent
Search
2000 character limit reached

Parallel-Tempered SGHMC

Updated 18 January 2026
  • Parallel-Tempered SGHMC is an MCMC method that integrates Nosé–Hoover thermostatted dynamics with parallel tempering to enhance multimodal sampling efficiency.
  • It employs a population of replicas with a geometric temperature ladder and periodic state exchanges to overcome local energy barriers.
  • Empirical results show improved effective sample size and reduced mixing time, boosting Bayesian neural network performance on complex datasets.

Parallel-Tempered Stochastic Gradient Hamiltonian Monte Carlo (Parallel-Tempered SGHMC), more precisely described as Parallel-Tempered Stochastic Gradient Nosé–Hoover Thermostat (PT-SGNHT), is a Markov Chain Monte Carlo (MCMC) method integrating parallel tempering with Nosé–Hoover (NH) thermostatted dynamics to efficiently address multimodal posterior sampling under stochastic gradients. By orchestrating a population of replicas at a geometrically-spaced ladder of temperatures and enabling periodic global state swaps, PT-SGHMC achieves reliable ergodic exploration of complex posterior landscapes, crucial for high-dimensional Bayesian inference and deep neural network learning with mini-batches (Luo et al., 2018).

1. Algorithmic Structure and Continuous-Time Dynamics

PT-SGHMC maintains RR independent replicas, each evolving under NH dynamics at distinct temperatures T1<<TRT_1 < \ldots < T_R. Each replica follows discretized NH evolution with parameters adapted to its temperature:

  • The extended Hamiltonian for replica ii at inverse temperature βi=1/Ti\beta_i = 1/T_i is

H(θ,p,ξ)=βiU(θ)+12pM1p+12ξ2Q,H(\theta,p,\xi) = \beta_i U(\theta) + \frac{1}{2}p^\top M^{-1} p + \frac{1}{2} \xi^2 Q,

where U(θ)=logπ(θD)U(\theta) = -\log \pi(\theta|\mathcal{D}) and ξ\xi is the thermostat variable.

  • The evolution of each replica is governed by

dθ=M1pdt, dp=βiU(θ)dtξpdt, dξ=pM1pDQdt,d\theta = M^{-1} p\,dt, \ dp = -\beta_i \nabla U(\theta)\,dt - \xi\,p\,dt, \ d\xi = \frac{p^\top M^{-1} p - D}{Q}\,dt,

with stochastic-gradient noise incorporated as an implicit Brownian term.

  • In discretization:

θiθi+ϵM1pi, pipiϵU~(θi)Tiϵξipi, ξiξi+ϵpiM1piDQ.\theta_i \leftarrow \theta_i + \epsilon M^{-1} p_i, \ p_i \leftarrow p_i - \epsilon \frac{\nabla \tilde U(\theta_i)}{T_i} - \epsilon \xi_i p_i, \ \xi_i \leftarrow \xi_i + \epsilon \frac{p_i^\top M^{-1} p_i - D}{Q}.

Key algorithmic steps include alternation between independent NH updates and exchange moves, which swap full (θ,p,ξ)(\theta, p, \xi) states between adjacent replicas, preserving detailed balance on the extended state space.

2. Exchange Moves and Parallel Tempering Protocol

In the parallel tempering framework, configuration swaps between replicas at neighboring temperatures enhance exploration by enabling low-temperature chains to inherit modes discovered by higher-temperature chains with higher mobility. The acceptance probability for exchanging states between replicas (i,i+1)(i, i+1) is determined using Barker's test:

  • Compute Δ=(1/Ti+11/Ti)[U(θi)U(θi+1)]\Delta = (1/T_{i+1} - 1/T_i)[U(\theta_i) - U(\theta_{i+1})].
  • Acceptance α=1/(1+exp(Δ))\alpha = 1/(1 + \exp(-\Delta)).
  • Draw uUniform(0,1)u \sim \text{Uniform}(0,1); swap if u<αu < \alpha.

Replica exchanges involve transferring the entire (θ,p,ξ)(\theta, p, \xi) triplet, ensuring each configuration’s adaptive thermostat tracks the temperature history, which preserves ergodicity and reversibility.

3. Theoretical Guarantees and Multimodal Exploration

Each individual replica’s NH dynamics at temperature TiT_i leaves the stationary distribution πi(θ)exp(U(θ)/Ti)\pi_i(\theta) \propto \exp(-U(\theta)/T_i) invariant and is ergodic. Swap proposals strictly maintain detailed balance for the joint product distribution iπi\prod_i \pi_i. Thus, the overall chain remains reversible and ergodic, with the marginal at T1=1T_1=1 targeting π(θD)\pi(\theta|\mathcal{D}).

High-temperature replicas facilitate movement across energy barriers, preventing multimodal trapping. Swapping propagates such diversified configurations to lower temperatures, a phenomenon unattainable through single-chain SGHMC or SGNHT under high-gradient noise.

4. Hyperparameterization and Computational Considerations

Hyperparameter selection is central for stability and efficiency:

  • Temperature ladder is recommended geometric: Ti=Tmax(i1)/(R1),Tmax10100,R=1020T_i = T_{\max}^{(i-1)/(R-1)}, \, T_{\max} \approx 10-100,\, R = 10-20.
  • Mass matrix MM as identity or diagonal (possibly estimated from past gradients).
  • Thermostat inertia QQ ideally O(D)O(D) for smooth adaptation.
  • Step-size ϵ\epsilon via dual-averaging pilot runs.
  • Swap interval L=1050L = 10-50 iterations.
  • Computational complexity per iteration is O(Rcost(U~))O(R \cdot \text{cost}(\nabla \tilde U)).
  • Swap communication involves O(D)O(D) floats between adjacent worker pairs.

These guidelines ensure scalability on large models and datasets. Packing replicas into a single [R×D][R \times D] tensor facilitates batched GPU execution. Exchange of states is optimized using asynchronous point-to-point (MPI/NCCL) communication.

5. Empirical Performance on Synthetic and Real-World Problems

On a 1D 4-component Gaussian mixture (gradient noise σ2=0.25\sigma^2 = 0.25), PT-SGNHT discovers all modes within approximately 10310^3 iterations, outperforming classic SGNHT/HMC which remain trapped. At T=1T=1, effective sample size is tripled. In 2D, PT-SGNHT consistently finds all five modes, with mixing time halved relative to standard SGNHT.

For Bayesian neural networks (three-layer MLP, MNIST, batch size 100):

  • PT achieves test log-likelihood 0.093-0.093 (compared to 0.102-0.102 for SGHMC and 0.115-0.115 for SGNHT).
  • Accuracy increases by approximately 0.6%0.6\%; predictive entropy decreases, indicating superior uncertainty quantification.

6. Implementation Details and Practical Tips

Optimizations include:

  • Packing all RR replicas as a single tensor to exploit batched computation.
  • Regularly estimating gradient noise variance σ2\sigma^2 and adjusting QQ for robust thermostat adaptation.
  • For noisy mini-batch acceptance under UU, a Hermite-expansion correction pCp_\mathcal{C} (with k4k \leq 4) mitigates swap bias.
  • Non-blocking communication strategies (asynchronous MPI/NCCL) facilitate efficient local state swapping.
  • Maintaining per-replica random seeds and checkpointing full (θ,p,ξ)(\theta, p, \xi) states ensures reproducibility and convenient warm restarts.

7. Significance and Use Cases

Parallel-Tempered SGHMC/SGNHT addresses key bottlenecks in scalable Bayesian computation, especially for models with pronounced multimodality and data-partition-induced gradient noise. By integrating parallel tempering with adaptive NH thermostats, it enables principled, reliable posterior inference central to deep Bayesian learning for large-scale datasets, where conventional single-chain MCMC strategies are inadequate (Luo et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parallel-Tempered SGHMC.