Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scalability of Metropolis-within-Gibbs schemes for high-dimensional Bayesian models (2403.09416v1)

Published 14 Mar 2024 in stat.CO, math.ST, stat.ML, and stat.TH

Abstract: We study general coordinate-wise MCMC schemes (such as Metropolis-within-Gibbs samplers), which are commonly used to fit Bayesian non-conjugate hierarchical models. We relate their convergence properties to the ones of the corresponding (potentially not implementable) Gibbs sampler through the notion of conditional conductance. This allows us to study the performances of popular Metropolis-within-Gibbs schemes for non-conjugate hierarchical models, in high-dimensional regimes where both number of datapoints and parameters increase. Given random data-generating assumptions, we establish dimension-free convergence results, which are in close accordance with numerical evidences. Applications to Bayesian models for binary regression with unknown hyperparameters and discretely observed diffusions are also discussed. Motivated by such statistical applications, auxiliary results of independent interest on approximate conductances and perturbation of Markov operators are provided.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Amit, Y. (1996). Convergence properties of the Gibbs sampler for perturbations of Gaussians. The Annals of Statistics 24(1), 122–140.
  2. Explicit convergence bounds for Metropolis Markov chains: isoperimetry, spectral gaps and profiles. arXiv preprint arXiv:2211.08959.
  3. Dimension-free mixing times of Gibbs samplers for Bayesian hierarchical models. Ann. Statist. In press.
  4. On the computational complexity of MCMC-based estimators in large samples.
  5. Besag, J. and P. J. Green (1993). Spatial statistics and Bayesian computation. Journal of the Royal Statistical Society Series B: Statistical Methodology 55(1), 25–37.
  6. Retrospective exact simulation of diffusion sample paths with applications. Bernoulli 12(6), 1077–1098.
  7. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli 19, 1501–1534.
  8. Estimating convergence of Markov chains with L-lag couplings. Advances in Neural Information Processing Systems 32.
  9. Isoperimetric constants for product probability measures. The Annals of Probability, 184–205.
  10. Handbook of Markov Chain Monte Carlo. Chapman and Hall.
  11. A calculus for Markov chain Monte Carlo: studying approximations in algorithms. arXiv preprint arXiv:2310.03853.
  12. Casella, G. and E. I. George (1992). Explaining the Gibbs Sampler. Am. Stat. 46, 167–174.
  13. Solidarity of Gibbs Samplers: the spectral gap. arXiv preprint arXiv:2304.02109.
  14. Dalalyan, A. S. (2017). Theoretical Guarantees for Approximate Sampling from Smooth and Log-Concave Densities. J. R. Stat. Soc. Ser. B. 79, 651–676.
  15. Gibbs Sampling, Exponential Families and Orthogonal Polynomials. Stat. Sci. 23, 151–178.
  16. Stochastic alternating projections. Illinois Journal of Mathematics 54(3), 963–979.
  17. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab. 27, 1551–1587.
  18. Log–concave sampling: Metropolis–Hastings algorithms are fast! J. Mach. Learn. Res. 20, 1–42.
  19. mcmcse: Monte Carlo Standard Errors for MCMC. R package.
  20. Efficient parametrisations for normal linear mixed models. Biometrika 82(3), 479–488.
  21. Gelfand, A. E. and A. F. Smith (1990). Sampling-based approaches to calculating marginal densities. Journal of the American statistical association 85(410), 398–409.
  22. Bayesian Data Analysis. CRC press.
  23. Gelman, A. and J. L. Hill (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
  24. Adaptive Rejection Sampling for Gibbs Sampling. J. R. Stat. Soc. Ser. C 41, 337–348.
  25. Gong, L. and J. M. Flegal (2015). A Practical Sequential Stopping Rule for High-Dimensional Markov Chain Monte Carlo. J. Comput. Graph. Stat. 25, 684–700.
  26. Bayesian computation: a summary of the current state, and samples backwards and forwards. Stat. Comput. 25, 835–862.
  27. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15(1), 1593–1623.
  28. Geometric ergodicity of Metropolis algorithms. Stochastic processes and their applications 85(2), 341–361.
  29. Jin, Z. and J. P. Hobert (2022). Dimension free convergence rates for Gibbs samplers for Bayesian linear mixed models. Stoch. Process. Their Appl. 148, 25–67.
  30. Component-wise Markov chain Monte Carlo: Uniform and geometric ergodicity under mixing and composition.
  31. Convergence of conditional Metropolis-Hastings samplers. Advances in Applied Probability 46(2), 422–445.
  32. Kamatani, K. (2014a). Local consistency of Markov chain Monte Carlo methods. Ann. Inst. Stat. Math. 66, 63–74.
  33. Kamatani, K. (2014b). Local consistency of Markov chain Monte Carlo methods. Annals of the Institute of Statistical Mathematics 66(1), 63–74.
  34. Ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC estimation of stochastic volatility models. Computational Statistics & Data Analysis 76, 408–423.
  35. Khare, K. and J. P. Hobert (2011). A spectral analytic comparison of trace-class data augmentation algorithms and their sandwich variants. The Annals of Statistics 39(5), 2585–2606.
  36. Rates of convergence of some multivariate Markov chains with polynomial eigenfunctions. Ann. Appl. Probab. 2, 737–777.
  37. Markov chains and mixing times, Volume 107. American Mathematical Soc.
  38. The Barker proposal: combining robustness and efficiency in gradient-based MCMC. Journal of the Royal Statistical Society Series B: Statistical Methodology 84(2), 496–523.
  39. Random Walks in a Convex Body and an Improved Volume Algorithm. Random Struct. and Alg. 4, 359–412.
  40. Markov chain decomposition for convergence rate analysis. Annals of Applied Probability, 581–606.
  41. Computing Bayes: From Then ‘Til Now. Stat. Sci. In press.
  42. Neath, R. C. and G. L. Jones (2009). Variable-at-a-time implementations of Metropolis-Hastings. arXiv preprint arXiv:0903.0664.
  43. Statistical inference with stochastic gradient algorithms. arXiv preprint arXiv 2207.
  44. On polynomial-time computation of high-dimensional posterior measures by Langevin-type algorithms. Journal of the European Mathematical Society.
  45. Scalable inference for crossed random effects models. Biometrika 107, 25–40.
  46. Non-Centered Parameterizations for Hierarchical Models and Data Augmentation (with discussion). In Bayesian Statistics (J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West, eds.), pp. 307–326.
  47. Data augmentation for diffusions. Journal of Computational and Graphical Statistics 22(3), 665–688.
  48. A General Framework for the Parametrization of Hierarchical Models. Statistical Science, 59–73.
  49. Scalable computation for Bayesian hierarchical models. arXiv preprint arXiv:2103.10875.
  50. Polson, N. G. and G. O. Roberts (1994). Bayes factors for discrete observations from diffusion processes. Biometrika 81(1), 11–26.
  51. Qin, Q. and J. P. Hobert (2019). Convergence complexity analysis of Albert and Chib’s algorithm for Bayesian probit regression. Ann. Statist. 47, 2320–2347.
  52. Qin, Q. and J. P. Hobert (2022). Wasserstein-based methods for convergence complexity analysis of MCMC with applications. Ann. Appl. Prob. 32, 124–166.
  53. Qin, Q. and G. L. Jones (2022). Convergence rates of two-component MCMC samplers. Bernoulli 28(2), 859–885.
  54. Spectral gap bounds for reversible hybrid Gibbs chains. arXiv preprint arXiv:2312.12782.
  55. Spectral Telescope: Convergence Rate Bounds for Random-Scan Gibbs Samplers Based on a Hierarchical Structure. arXiv preprint arXiv:2208.11299.
  56. Geometric ergodicity and hybrid Markov chains.
  57. Roberts, G. O. and J. S. Rosenthal (1998). Optimal scaling of discrete approximations to Langevin diffusions. J. R. Stat. Soc. Ser. B 60, 255–268.
  58. Roberts, G. O. and J. S. Rosenthal (2001). Markov Chains and De-Initializing Processes. Scand. J. Stat. 28, 489–504.
  59. Roberts, G. O. and S. H. Sahu (1997). Updating Schemes, Correlation Structure, Blocking and Parameterization for the Gibbs Sampler. J. R. Stat. Soc. Ser. B 59, 291–317.
  60. Roberts, G. O. and S. K. Sahu (2001). Approximate predetermined convergence properties of the Gibbs sampler. Journal of Computational and Graphical Statistics 10(2), 216–229.
  61. On inference for partially observed nonlinear diffusion models using the Metropolis–Hastings algorithm. Biometrika 88(3), 603–621.
  62. Rosenthal, J. S. (1995). Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo. J. Am. Stat. Assoc 90, 558–566.
  63. Smith, A. F. and G. O. Roberts (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Methodological) 55(1), 3–23.
  64. Stan Development Team (2024). RStan: the R interface to Stan. R package version 2.32.5.
  65. Computational Complexity of Metropolis-Adjusted Langevin Algorithms for Bayesian Posterior Sampling. arXiv preprint arXiv:2206.06491.
  66. On the Computational Complexity of Metropolis-Adjusted Langevin Algorithms for Bayesian Posterior Sampling. arXiv preprint arXiv:2206.06491.
  67. MALA-within-Gibbs samplers for high-dimensional distributions with sparse conditional structure. SIAM Journal on Scientific Computing 42(3), A1765–A1788.
  68. Van der Vaart, A. W. (2000). Asymptotic Statistics. Cambridge University Press.
  69. Minimax Mixing Time of the Metropolis-Adjusted Langevin Algorithm for Log-Concave Sampling. J. Mach. Learn. Res. 23, 1–63.
  70. Yang, J. and J. S. Rosenthal (2022). Complexity results for MCMC derived from quantitative bounds. Ann. Appl. Prob. 33, 1459–1500.
  71. Yu, Y. and X. L. Meng (2011). To center or not to center: That is not the question: an Ancillarity–Sufficiency Interweaving Strategy (ASIS) for boosting MCMC efficiency. Journal of Computational and Graphical Statistics 20(3), 531–570.
Citations (4)

Summary

  • The paper establishes dimension-free convergence of MwG schemes in high-dimensional Bayesian models using conditional conductance analysis.
  • It derives robust theoretical bounds that link MwG performance to Gibbs samplers in non-conjugate hierarchical settings.
  • Applications to Bayesian binary regression and discretely observed diffusions confirm the practical efficiency of coordinate-wise MCMC methods.

Scalability of Metropolis-within-Gibbs Schemes for High-Dimensional Bayesian Models

Abstract Overview

The paper "Scalability of Metropolis-within-Gibbs schemes for high-dimensional Bayesian models" investigates coordinate-wise Markov Chain Monte Carlo (MCMC) schemes, specifically Metropolis-within-Gibbs (MwG) samplers used for Bayesian non-conjugate hierarchical models. The authors analyze how the convergence properties of these MwG schemes relate to those of the corresponding Gibbs samplers (GS) via the notion of conditional conductance. The work establishes dimension-free convergence results under high-dimensional conditions and discusses applications to Bayesian binary regression models with unknown hyperparameters and discretely observed diffusions.

Introduction and Motivating Example

Coordinate-wise MCMC methods, such as GS and MwG, play a pivotal role in Bayesian inference for complex structured data models. These models can include spatial, hierarchical, or temporal dependencies. While GS are well-understood theoretically, their application is limited to cases with specialized conditional conjugacy properties. Consequently, general coordinate-wise samplers like MwG are essential but remain comparatively under-studied.

A motivating example used is the hierarchical logistic model, illustrating how empirical performance can validate theoretical insights. In high-dimensional regimes where the number of datapoints and parameters simultaneously increase, coordinate-wise schemes continue to show strong performance attributes.

Scalability of Coordinate-wise MCMC

Conditional Conductance and Theoretical Results

Understanding the performance of coordinate-wise MCMC schemes necessitates relating them to GS through the concept of conditional conductance. The conditional conductance measures how much an invariant update, like MhG, approximates an exact sample from the conditional distribution. The authors provide bounds on the conductance of generic coordinate-wise schemes with a particular focus on non-conjugate hierarchical Bayesian models:

$\Phi_s(P) \geq \kappa(P, \sX)\Phi_{s}(G) - \frac{\pi(K^c)}{s}\left(\frac{1}{d}\sum_{i = 1}^d\kappa_i(P_i,K)\right)$

where Φs(P)\Phi_s(P) denotes the s-conductance of the coordinate-wise scheme PP and κ(P,K)\kappa(P, K) reflects the minimum conditional conductance over the set KK. This provides a powerful framework for evaluating the convergence properties of coordinate-wise schemes relative to GS, with the bound remaining informative even when controlling conditional conductance on a subset of the state space KK.

Statistical Applications and Auxiliary Results

Bayesian Hierarchical Models

For Bayesian hierarchical models, the paper applies the developed theory to establish dimension-free convergence properties. The main contributions include the identification of conditions under which MwG schemes exhibit desirable computational and mixing properties in high-dimensional settings. For instance, the authors prove that MwG schemes show dimension-free behavior of the total variation mixing times for hierarchical models:

$\pi_J(\d\psi, \d \bm{\theta}) = \mathcal{L}(\psi, \bm{\theta} \mid Y_{1:J})$

where πJ\pi_J denotes the posterior distribution for the entire dataset. Under certain conditions, the results imply that the mixing times remain uniformly bounded, effectively illustrating the robustness of coordinate-wise samplers in complex models.

Conductance Bounds for Specific Schemes

The paper also considers specific forms of MwG, including Independent Metropolis-Hastings (IMH) and random-walk Metropolis (RWM) updates within Gibbs. For IMH updates, given a uniform upper bound on the Radon-Nykodym derivative of the proposal distribution, the authors show that the conductance of MwG relates closely to the one of the GS. Similarly, for RWM within Gibbs, the conductance bound is derived under strong log-concavity assumptions of the target distribution, demonstrating that these schemes are not excessively slowed down in high dimensions.

$\kappa(P_i^{\x_{-i}}) \geq \frac{1}{M}$

Implications and Future Directions

The findings have substantial practical implications, especially in terms of computational efficiency for high-dimensional Bayesian inference. The use of MwG schemes offers a scalable alternative when GS may not be feasible due to the lack of conditional conjugacy. Moreover, applications such as Bayesian binary regression with unknown hyperparameters and data augmentation schemes for diffusions underline the broad applicability of the developed theory.

The work suggests that future research could explore extending these results to other high-dimensional structured Bayesian models, potentially bridging the gap between theory and practical implementation in MCMC. Moreover, the techniques employed might inspire further studies focusing on mixed update strategies and their theoretical guarantees under various model assumptions.

Conclusion

This paper makes significant strides in understanding the scalability and efficiency of MwG schemes for high-dimensional Bayesian models. By establishing dimension-free convergence properties and providing practical applications, the authors contribute valuable insights that enhance both the theoretical foundations and practical utility of coordinate-wise MCMC methods, paving the way for faster and more reliable Bayesian inference in complex models.