When does Metropolized Hamiltonian Monte Carlo provably outperform Metropolis-adjusted Langevin algorithm?

Published 10 Apr 2023 in stat.CO, cs.CC, and stat.ML | (2304.04724v2)

Abstract: We analyze the mixing time of Metropolized Hamiltonian Monte Carlo (HMC) with the leapfrog integrator to sample from a distribution on $\mathbb{R}^d$ whose log-density is smooth, has Lipschitz Hessian in Frobenius norm and satisfies isoperimetry. We bound the gradient complexity to reach $\epsilon$ error in total variation distance from a warm start by $\tilde O(d^{{1/4}\text{polylog}(1/\epsilon))$} and demonstrate the benefit of choosing the number of leapfrog steps to be larger than 1. To surpass previous analysis on Metropolis-adjusted Langevin algorithm (MALA) that has $\tilde{O}(d^{{1/2}\text{polylog}(1/\epsilon))$} dimension dependency in Wu et al. (2022), we reveal a key feature in our proof that the joint distribution of the location and velocity variables of the discretization of the continuous HMC dynamics stays approximately invariant. This key feature, when shown via induction over the number of leapfrog steps, enables us to obtain estimates on moments of various quantities that appear in the acceptance rate control of Metropolized HMC. Moreover, to deal with another bottleneck on the HMC proposal distribution overlap control in the literature, we provide a new approach to upper bound the Kullback-Leibler divergence between push-forwards of the Gaussian distribution through HMC dynamics initialized at two different points. Notably, our analysis does not require log-concavity or independence of the marginals, and only relies on an isoperimetric inequality. To illustrate the applicability of our result, several examples of natural functions that fall into our framework are discussed.

Abstract PDF Upgrade to Chat

Citations (12)

View on Semantic Scholar

Summary

The paper shows that metropolized HMC attains gradient complexity of ~O(d^(1/4) polylog(1/ε)), outperforming MALA’s ~O(d^(1/2)).
It demonstrates that using multiple leapfrog steps enhances proposal overlap and acceptance rates, leading to improved mixing times.
The analysis provides practical guidelines for parameter tuning and resource allocation in high-dimensional Bayesian inference.

Analysis of Metropolized HMC vs MALA

Introduction

The paper "When does Metropolized Hamiltonian Monte Carlo provably outperform Metropolis-adjusted Langevin algorithm?" explores the performance boundaries of Metropolized Hamiltonian Monte Carlo (HMC) compared to the Metropolis-adjusted Langevin Algorithm (MALA) under specific conditions. It explores the intricacies of mixing time analysis, focusing on the gradient complexity necessary for achieving a specified error in total variation distance when sampling from smooth target densities.

Key Contributions

Mixing Time and Gradient Complexity Analysis: The paper establishes that, for Metropolized HMC with leapfrog integrators, the gradient complexity needed to achieve $\epsilon$ error in total variation distance is $\tilde O(d^{1/4} polylog(1/\epsilon))$ for target distributions with a smooth log-density. This presents an improvement over existing results for MALA, which scales as $\tilde{O}(d^{1/2} polylog(1/\epsilon))$ .
Advantage of Multiple Leapfrog Steps: The study identifies scenarios where having more than one leapfrog step ( $K > 1$ ) in the HMC leads to better performance compared to MALA ( $K=1$ ). This is due to the approximate invariance of the joint distribution of location and velocity variables in continuous HMC dynamics.
Proposal Overlap and Acceptance Rate: Novel techniques are introduced to address the challenges in controlling the overlap of proposal distributions and the acceptance rate in HMC dynamics. A new approach upper bounds the Kullback-Leibler divergence between the images of Gaussian distributions through HMC dynamics started at different points.

Implementation Considerations

Leapfrog Integrator: Implementing Metropolized HMC involves careful selection of the number of leapfrog steps $K$ and the step-size $\eta$ . The choice of these parameters significantly affects the mixing time and, consequently, the computational efficiency.
Parameter Tuning: For practical applications, it is essential to start with an adequately warm initial distribution to minimize warm-up time. The optimal settings depend on the smoothness and other differential properties of the target distribution's log-density.
Computational Complexity: Computational resources should be allocated with consideration to the target distribution dimensionality. The $\tilde O(d^{1/4})$ scaling achieved for HMC in this framework provides a more favorable computational scaling compared to $\tilde{O}(d^{1/2})$ for MALA, under specific regularity conditions.

Trade-offs and Scalability

Resource Requirements: Metropolized HMC can be more resource-intensive than simpler MCMC methods due to the necessity of computing gradients multiple times, especially in high-dimensional settings.
Applicability: The method is particularly advantageous when sampling from complex, high-dimensional probablistic models where traditional methods falter due to random walk behavior.
Limitations: The assumptions of smoothness, Lipschitz continuity, and isoperimetry must be carefully validated for the target applications. These assumptions may restrict the applicability to certain classes of problems.

Practical Implications and Future Directions

The paper outlines several practical implications for Bayesian inference and high-dimensional statistics. Specifically, it suggests parameter settings that can reduce computational costs in practice and provides new insights into understanding the conditions under which HMC is preferred over simpler methods like MALA. The methodology employed may inspire future research into adaptive strategies for HMC, further investigation into non-log-concave distributions, and extending these insights to other sampling algorithms outside the MCMC paradigm.

Markdown