RLMH: Reinforcement Learning Metropolis-Hastings

Updated 2 July 2025

RLMH is an adaptive MCMC method that frames Metropolis-Hastings kernel tuning as a reinforcement learning task for efficient Bayesian inference.
It employs state-dependent proposals and composite reward functions, such as the Contrastive Divergence Lower Bound, to optimize convergence and ensure ergodicity.
Empirical studies demonstrate that RLMH automates sampler tuning effectively, outperforming classical methods in diverse probabilistic models.

Reinforcement Learning Metropolis-Hastings (RLMH) is an adaptive Markov Chain Monte Carlo (MCMC) methodology in which the design or tuning of Metropolis-Hastings transition kernels is framed as a reinforcement learning (RL) task. RLMH integrates principles from RL, stochastic optimization, and MCMC to automate and enhance sampling from complex distributions, particularly in Bayesian inference and probabilistic machine learning. Recent research has established both theoretical foundations and practical effectiveness for RLMH, demonstrating that such approaches can reliably yield fast-mixing, adaptive samplers across a range of real-world scenarios.

1. Conceptual Foundation: Metropolis-Hastings as a Markov Decision Process

The theoretical underpinning of RLMH is the realization that adaptive tuning of MH samplers can be formulated as a Markov Decision Process (MDP) (Wang et al., 22 May 2024, Wang et al., 1 Jul 2025). In this formalism:

State: Typically defined as $(X_n, X_{n+1}^*)$ , where $X_n$ is the current state and $X_{n+1}^*$ is the proposed state of the chain.
Action: The action $a_n$ specifies the parameters of the proposal mechanism, usually realized as a mapping $a_n = [\phi(X_n), \phi(X_{n+1}^*)]$ for a learnable function $\phi$ .
Transition and Reward: The chain transitions using a parameterized proposal and the Metropolis-Hastings acceptance step. The reward function is crafted to incentivize efficient exploration and rapid convergence toward the stationary distribution.

This MDP framing allows the use of reinforcement learning algorithms such as Deep Deterministic Policy Gradient (DDPG), placing RLMH within the broader context of automated probabilistic computation and adaptive MCMC.

2. Policy Parameterization and Learning Algorithms

RLMH employs parameterized transition kernels, often leveraging neural networks for flexible and expressive state-dependent proposals.

Policy Structure: Proposals are parameterized as $q_{\phi(X_n)}(\cdot \mid X_n)$ , where $\phi$ is optimized to maximize a performance criterion through policy gradient methods.
Reward Functions: Multiple reward signals have been considered:
- Expected Squared Jump Distance (ESJD) and Average Acceptance Rate (AAR) quantify mixing efficiency, but have been found to sometimes provide insufficient learning signal for robust RL training (Wang et al., 1 Jul 2025).
- Contrastive Divergence Lower Bound (CDLB): Introduced as a more informative reward, the CDLB is derived from the one-step decrease in Kullback–Leibler divergence between the chain's marginal and the target, capturing both exploitation (progress toward high probability regions) and exploration (entropy of the acceptance process).

Policy parameters $\theta$ are adapted via stochastic policy gradient updates: $\theta_{n+1} \gets \theta_n + \alpha_n \nabla_\theta J_n(\phi_\theta)$ where $J_n(\phi_\theta)$ is the empirical reward estimate and $\alpha_n$ is a decaying learning rate to ensure convergence and ergodicity.

3. Theoretical Guarantees: Ergodicity and Performance Metrics

A crucial requirement for adaptive MCMC is that adaptation must not compromise the correctness of the stationary distribution. RLMH achieves this through control mechanisms on learning rates and parameter updates:

Diminishing Adaptation: The step sizes for policy parameter updates satisfy $\sum_n \alpha_n < \infty$ , and the norm of updates is capped, ensuring that adaptation "slows down" and stabilization is achieved (Wang et al., 22 May 2024).
Containment: Policy parameterization is chosen so that the family of transition kernels remains within an ergodic, well-behaved set.
Ergodicity: RLMH can guarantee invariance and geometric ergodicity of the sampled chain under mild assumptions, via extensions of the simultaneous minorisation and drift conditions framework.

Empirical validation primarily uses metrics such as Maximum Mean Discrepancy (MMD) with respect to gold-standard posterior samples, as well as ESJD and acceptance rates.

4. Adaptive Gradient-Based Samplers and Proposal Schemes

Modern RLMH extends beyond gradient-free adaptations by embracing gradient information and local geometry:

Hessian-Informed Proposals: For targets with strongly concentrated posteriors, Gaussian proposals with covariance matched to the Hessian of the negative log-posterior at the mode demonstrate spectral gap and mixing properties independent of concentration, supporting robust sampling in high-information regimes (Rudolf et al., 2022).
Position-Dependent Step Sizes: Adaptive Metropolis-adjusted Langevin Algorithms (RMALA) with neural network–parameterized, state-dependent step sizes balance proposal flexibility with tractable learning and have demonstrated superior mixing for complex posteriors (Wang et al., 1 Jul 2025).
Score-Based MH: In scenarios where only the score function is available (e.g., diffusion models), learning the MH acceptance function using the detailed balance condition and scores allows robust sampling even from heavy-tailed or multimodal distributions (Aloui et al., 31 Dec 2024).

5. Gradient Estimation via Differentiable Metropolis-Hastings

Recent advances enable differentiating through the entire MH algorithm, using recoupling chain techniques that compute pathwise derivatives of intractable expectations with respect to sampler parameters (Arya et al., 20 Jun 2024). This enables continuous, unbiased gradient estimation for:

Automatic MCMC Tuning: Direct optimization of proposal parameters (e.g., step size, shape).
Sensitivity Analysis: Derivatives of posterior expectations with respect to hyperparameters, such as prior strength.
RL Applications: Differentiable MCMC is applicable in reinforcement learning to enable backpropagation through policy evaluation or environment models when expectations are estimated via MH.

6. Empirical Performance and Practical Considerations

RLMH has been validated on comprehensive benchmarks, notably PosteriorDB, which encompasses a suite of real Bayesian inference problems of varying complexity (Wang et al., 22 May 2024, Wang et al., 1 Jul 2025).

Empirical Results: RLMH with gradient-free adaptation outperforms classical adaptive random-walk MH on approximately 90% of tasks in low to moderate dimensions. Gradient-based RLMH (e.g., position-dependent RMALA) further improves mixing quality, outperforming constant step-size MALA in 89% of tasks.
Stability: Rewards based on CDLB eliminate instability and catastrophic failures found with ESJD or AAR, especially in non-stationary phases.
Learning Timescale: Although training adaptive policies incurs computational overhead, RLMH justifies such investment in scenarios where accurate posterior inference is critical and repeated manual tuning is infeasible.

Aspect	Classical Adaptive MCMC	RLMH Framework
Proposal Adaptation	Heuristic, often state-agnostic	Learned, state-dependent via RL
Reward/Objective	Acceptance Rate, ESJD	CDLB, composite rewards
Theoretical Guarantee	Manual drift/minorisation check	Diminishing adaptation, RL containment
Practical Tuning	Manual, potentially error-prone	Automated, data-driven

7. Extensions and Open Directions

RLMH defines a unifying interface between RL and MCMC, suggesting several avenues for ongoing development:

Scalable Architectures: Further research is directed at developing scalable acceptance networks and efficient Hessian or Fisher information approximations for high-dimensional targets.
Complex Sampling Tasks: The framework is extensible to Hamiltonian Monte Carlo, multiple-try MH, and delayed acceptance schemes.
Score-Based and Likelihood-Free Sampling: The ability to learn MH accept/reject mechanisms from scores rather than densities enables sampling in likelihood-free or simulator-based models.
Reinforcement Learning Applications: RLMH methodologies may be integrated within RL environments for adaptive policy sampling, exploration, and posterior RL, enabling robust, uncertainty-aware agent behavior in complex domains.

A plausible implication is that, as reinforcement learning algorithms and neural function approximation techniques continue to scale, RL-driven adaptive samplers such as RLMH will become increasingly central to automated, reliable Bayesian computation in scientific and industrial applications.