On Cyclical MCMC Sampling (2403.00230v1)
Abstract: Cyclical MCMC is a novel MCMC framework recently proposed by Zhang et al. (2019) to address the challenge posed by high-dimensional multimodal posterior distributions like those arising in deep learning. The algorithm works by generating a nonhomogeneous Markov chain that tracks -- cyclically in time -- tempered versions of the target distribution. We show in this work that cyclical MCMC converges to the desired probability distribution in settings where the Markov kernels used are fast mixing, and sufficiently long cycles are employed. However in the far more common settings of slow mixing kernels, the algorithm may fail to produce samples from the desired distribution. In particular, in a simple mixture example with unequal variance, we show by simulation that cyclical MCMC fails to converge to the desired limit. Finally, we show that cyclical MCMC typically estimates well the local shape of the target distribution around each mode, even when we do not have convergence to the target.
- Variational Inference: A Review for Statisticians. ArXiv e-prints arXiv:1601.00670.
- Stochastic gradient hamiltonian monte carlo. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32. ICML’14, JMLR.org.
- Markov chains. Springer International Publishing.
- High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli 25 2854–2882.
- Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine Learning (M. F. Balcan and K. Q. Weinberger, eds.), vol. 48 of Proceedings of Machine Learning Research. PMLR.
- Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Statist. Sci. 13 163–185.
- Geyer, C. (1991). Markov chain monte carlo maximum likelihood, in computing science and statistics: Proceedings of the 32rd symposium on the interface, ed. e.m. keramigas, fairfax: Interface foundation, pp 156-163 .
- Annealing markov chain monte carlo with applications to pedigree analysis. Journal of the American Statistical Association 90 909–920.
- Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Information Processing Systems (J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira and K. Weinberger, eds.), vol. 24. Curran Associates, Inc.
- Hervé, L. (2008). Vitesse de convergence dans le théorème limite central pour des chaînes de Markov fortement ergodiques. Ann. Inst. Henri Poincaré Probab. Stat. 44 280–292.
- Exchange monte carlo method and application to spin glass simulations. Journal of the Physical Society of Japan 65 1604–1608.
- Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning 110 457–506.
- Geometric ergodicity of metropolis algorithms. Sto. Proc. Appl. 85 341–361.
- Exponential bounds and stopping rules for mcmc and general markov chains. First International Conference on Performance Evaluation Methodologies and Tools, Pisa, Italy .
- A complete recipe for stochastic gradient mcmc. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’15, MIT Press, Cambridge, MA, USA.
- Simulated tempering: A new monte carlo schemes. Europhysics letters 19 451–458.
- Matthews, P. (1993). A slowly mixing markov chain with implications for gibbs sampling. Statistics & Probability Letters 17 231–236.
- Markov chains and stochastic stability. 2nd ed. Cambridge University Press, Cambridge.
- Entropic gradient descent algorithms and wide flat minima. In International Conference on Learning Representations. URL https://openreview.net/forum?id=xjXg0bnoDmS
- Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Proceedings of the 2017 Conference on Learning Theory (S. Kale and O. Shamir, eds.), vol. 65 of Proceedings of Machine Learning Research. PMLR.
- Monte Carlo statistical methods. 2nd ed. Springer Texts in Statistics, Springer-Verlag, New York.
- Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 1929–1958.
- Parallel tempering on optimized paths. In Proceedings of the 38th International Conference on Machine Learning (M. Meila and T. Zhang, eds.), vol. 139 of Proceedings of Machine Learning Research. PMLR. URL https://proceedings.mlr.press/v139/syed21a.html
- Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning. ICML’11, Omnipress, USA.
- Whiteley, N. (2013). Stability properties of some particle filters. The Annals of Applied Probability 23 2500 – 2537.
- Sufficient conditions for torpid mixing of parallel and simulated tempering. Electron. J. Probab. 14 780–804. URL https://doi.org/10.1214/EJP.v14-638
- Conditions for rapid mixing of parallel and simulated tempering on multimodal distributions. Ann. Appl. Probab. 19 617–640.
- Cyclical stochastic gradient mcmc for bayesian deep learning. arXiv preprint arXiv:1902.03932 .