Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MCMC-driven learning (2402.09598v1)

Published 14 Feb 2024 in stat.ML, cs.LG, math.ST, stat.CO, and stat.TH

Abstract: This paper is intended to appear as a chapter for the Handbook of Markov Chain Monte Carlo. The goal of this chapter is to unify various problems at the intersection of Markov chain Monte Carlo (MCMC) and machine learning$\unicode{x2014}$which includes black-box variational inference, adaptive MCMC, normalizing flow construction and transport-assisted MCMC, surrogate-likelihood MCMC, coreset construction for MCMC with big data, Markov chain gradient descent, Markovian score climbing, and more$\unicode{x2014}$within one common framework. By doing so, the theory and methods developed for each may be translated and generalized.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (132)
  1. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5):3235–3249, 2012.
  2. Shun’ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
  3. Reparameterizing mirror descent as gradient descent. Advances in Neural Information Processing Systems, 33:8430–8439, 2020.
  4. Stability of stochastic approximation under verifiable conditions. In IEEE Conference on Decision and Control, 2005.
  5. On the ergodicity properties of some adaptive MCMC algorithms. The Annals of Applied Probability, 16(3):1462–1505, 2006.
  6. Controlled MCMC for optimal sampling. Cahiers du Cérémade 0125, 2001.
  7. A tutorial on adaptive MCMC. Statistics and Computing, 18:343–373, 2008.
  8. Markovian stochastic approximation with expanding projections. Bernoulli, 20(2):545–585, 2014.
  9. Patterns of scalable Bayesian inference. Foundations and Trends in Machine Learning, 9(2-3):119–247, 2016.
  10. Annealed flow transport Monte Carlo. In International Conference on Machine Learning, pages 318–330. PMLR, 2021.
  11. Towards optimal scaling of Metropolis-coupled Markov chain Monte Carlo. Statistics and Computing, 21:555–568, 2011.
  12. Automatic differentiation in machine learning: A survey. Journal of Machine Learning Research, 18:1–43, 2018.
  13. Approximate Bayesian computation in population genetics. Genetics, 162(4):2025–2035, 2002.
  14. Adaptive Algorithms and Stochastic Approximations, volume 22. Springer Science & Business Media, 2012.
  15. Automatic regenerative simulation via non-reversible simulated tempering. arXiv:2309.05578, 2023.
  16. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
  17. SGD-QN: Careful quasi-Newton stochastic gradient descent. Journal of Machine Learning Research, 10(59):1737–1754, 2009.
  18. JAX: Composable transformations of Python+NumPy programs, 2018.
  19. Transport elliptical slice sampling. In Artificial Intelligence and Statistics, pages 3664–3676. PMLR, 2023.
  20. Recursive pathways to marginal likelihood estimation with prior-sensitivity analysis. Statistical Science, 29(3):397–419, 2014.
  21. Sparse variational inference: Bayesian coresets from scratch. Advances in Neural Information Processing Systems, 32, 2019.
  22. Bayesian coreset construction via greedy iterative geodesic ascent. In International Conference on Machine Learning, pages 698–706. PMLR, 2018.
  23. Automated scalable Bayesian inference via Hilbert coresets. The Journal of Machine Learning Research, 20(1):551–588, 2019.
  24. Making SGD parameter-free. In Proceedings of Thirty Fifth Conference on Learning Theory, pages 2360–2389. PMLR, 2022.
  25. Rao-Blackwellisation of sampling schemes. Biometrika, 83(1):81–94, 1996.
  26. Prediction, Learning, and Games. Cambridge University Press, Cambridge ; New York, 2006.
  27. Convergence and robustness of the Robbins–Monro algorithm truncated at randomly varying bounds. Stochastic Processes and their Applications, 27:217–231, 1988.
  28. Stochastic approximation procedures with randomly varying truncations. Scientia Sinica 1, 29:914–926, 1986.
  29. Coreset Markov chain Monte Carlo. arXiv:2310.17063, 2023.
  30. Bayesian inference via sparse Hamiltonian flows. Advances in Neural Information Processing Systems, 35:20876–20888, 2022.
  31. Air Markov chain Monte Carlo, 2018. arXiv:1801.09309.
  32. Variational MCMC. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pages 120–127, 2001.
  33. Truncated-Newton algorithms for large-scale unconstrained optimization. Mathematical Programming, 26(2):190–212, 1983.
  34. Density estimation using Real NVP. arXiv:1605.08803, 2016.
  35. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 2011.
  36. Neural spline flows. Advances in Neural Information Processing Systems, 32, 2019.
  37. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A), 2019.
  38. Finite-time High-probability Bounds for Polyak-Ruppert Averaged Iterates of Linear Stochastic Approximation, 2023. arXiv:2207.04475.
  39. Donald L. Ermak. A computer simulation of charged particles in solution. The Journal of Chemical Physics, 62(10):4189–4196, 1975.
  40. Convergence of the Monte Carlo expectation maximization for curved exponential families. The Annals of Statistics, 31(4):1220–1259, 2003.
  41. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.
  42. Adaptive Monte Carlo augmented with normalizing flows. Proceedings of the National Academy of Sciences, 119(10):e2109420119, 2022.
  43. Jørund Gåsemyr. On an adaptive version of the Metropolis–Hastings algorithm with independent proposal distribution. Scandinavian Journal of Statistics, 30(1):159–173, 2003.
  44. MCMC variational inference via uncorrected Hamiltonian annealing. In Advances in Neural Information Processing Systems, 2021.
  45. Charles J Geyer. Markov chain Monte Carlo maximum likelihood. Computing Science and Statistics, Proceedings of the 23rd Symposium on the Interface, pages 156–163, 1991.
  46. Annealing Markov chain Monte Carlo with applications to ancestral inference. Journal of the American Statistical Association, 90(431):909–920, 1995.
  47. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2013.
  48. Peter W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
  49. Variance-reduced methods for machine learning. Proceedings of the IEEE, 108(11):1968–1983, 2020.
  50. On sampling with approximate transport maps. arXiv:2302.04763, 2023.
  51. An overview of stochastic quasi-Newton methods for large-scale machine learning. Journal of the Operations Research Society of China, 11(2):245–275, 2023.
  52. Bayesian optimization for likelihood-free inference of simulator-based statistical models. Journal of Machine Learning Research, 17(125):1–47, 2016.
  53. An adaptive Metropolis algorithm. Bernoulli, 7(2):223–242, 2001.
  54. Stochastic normalizing flows for inverse problems: a Markov chains viewpoint. SIAM/ASA Journal on Uncertainty Quantification, 10(3):1162–1190, 2022.
  55. NeuTra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. arXiv:1903.03704, 2019.
  56. Coresets for scalable Bayesian logistic regression. Advances in Neural Information Processing Systems, 29, 2016.
  57. Exchange Monte Carlo method and application to spin glass simulations. Journal of the Physical Society of Japan, 65(6):1604–1608, 1996.
  58. DoG is SGD’s best friend: A parameter-free dynamic step size schedule, 2023. arXiv:2302.12022.
  59. Martin Jankowiak and Du Phan. Surrogate likelihoods for variational annealed importance sampling. arXiv:2112.12194, 2021.
  60. Efficient acquisition rules for model-based approximate Bayesian computation. Bayesian Analysis, 14(2):595–622, 2019.
  61. Query efficient posterior estimation in scientific experiments via Bayesian active learning. Artificial Intelligence, 243:45–56, 2017.
  62. Markov chain score ascent: A unifying framework of variational inference with Markovian gradients. Advances in Neural Information Processing Systems, 35:34802–34816, 2022.
  63. A guide to sample average approximation. In Handbook of Simulation Optimization, volume 216, pages 207–243. Springer, 2015.
  64. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  65. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
  66. Automatic differentiation variational inference. Journal of Machine Learning Research, 2017.
  67. Stochastic gradient descent for nonconvex learning without bounded gradient assumptions. IEEE Transactions on Neural Networks and Learning Systems, 31(10):4394–4400, 2020.
  68. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1):503–528, 1989.
  69. Sridhar Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22(1):159–195, 1996.
  70. Adaptive incremental mixture Markov chain Monte Carlo. Journal of Computational and Graphical Statistics, 28(4):790–805, 2019.
  71. Bayesian pseudocoresets. Advances in Neural Information Processing Systems, 33:14950–14960, 2020.
  72. An introduction to sampling via measure transport. arXiv:1602.05023, 2016.
  73. Continual repeated annealed flow transport Monte Carlo. In International Conference on Machine Learning, pages 15196–15219. PMLR, 2022.
  74. GPS-ABC: Gaussian process surrogate approximate Bayesian computation. In Uncertainty in Artificial Intelligence, pages 593–602, 2014.
  75. Warp Bridge Sampling. Journal of Computational and Graphical Statistics, 11(3):552–586, 2002.
  76. Monte Carlo gradient estimation in machine learning. Journal of Machine Learning Research, 21(132):1–62, 2020.
  77. Neural Networks: Tricks of the Trade: Second Edition, volume 7700 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2012.
  78. Jacques Morgenstern. How to compute fast a function and all its derivatives: a variation on the theorem of Baur-Strassen. ACM SIGACT News, 16(4):60–62, 1985.
  79. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, volume 24, 2011.
  80. Elliptical slice sampling. In International Conference on Artificial Intelligence and Statistics, pages 541–548. JMLR Workshop and Conference Proceedings, 2010.
  81. Markovian score climbing: Variational inference with KL(p||q). Advances in Neural Information Processing Systems, 33:15499–15510, 2020.
  82. Fast Bayesian coresets via subsampling and quasi-Newton refinement. Advances in Neural Information Processing Systems, 35:70–83, 2022.
  83. Stephen G. Nash. A survey of truncated-Newton methods. Journal of Computational and Applied Mathematics, 124(1):45–59, 2000.
  84. Radford Neal. Bayesian learning via stochastic dynamics. In Advances in Neural Information Processing Systems, volume 5, 1992.
  85. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, January 2009.
  86. Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience Series in Discrete Mathematics. Wiley, Chichester ; New York, 1983.
  87. Y. Nesterov. A method for solving the convex programming problem with convergence rate O⁢(1/k2)𝑂1superscript𝑘2O(1/k^{2})italic_O ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Proceedings of the USSR Academy of Sciences, 1983.
  88. Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1):2617–2680, 2021.
  89. Sequential neural likelihood: Fast likelihood-free inference with autoregressive flows. In Artificial Intelligence and Statistics, pages 837–848, 2019.
  90. Perturbation corrections in approximate inference: Mixture modelling applications. Journal of Machine Learning Research, 10(43):1263–1304, 2009.
  91. Transport map accelerated Markov chain Monte Carlo. SIAM/ASA Journal on Uncertainty Quantification, 6(2):645–682, 2018.
  92. Boris Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
  93. Boris Polyak. New stochastic approximation type procedures. Avtomatica i Telemekhanika, 7:98–107, 1990.
  94. Bayesian synthetic likelihood. Journal of Computational and Graphical Statistics, 27(1):1–11, 2018.
  95. Making gradient descent optimal for strongly convex stochastic optimization. In International Conference on Machine Learning, 2012.
  96. Black box variational inference. In Artificial Intelligence and Statistics, pages 814–822. PMLR, 2014.
  97. Carl Rasmussen. Gaussian processes to speed up hybrid Monte Carlo for expensive Bayesian integrals. In Bayesian Statistics, pages 651–659, 2003.
  98. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018.
  99. On the convergence of Adam and beyond. arXiv:1904.09237, 2019.
  100. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015.
  101. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.
  102. A stochastic approximation method. Annals of Mathematical Statistics, 22(3):400–407, 1951.
  103. Weak convergence and optimal scaling of random walk Metropolis algorithms. The Annals of Applied Probability, 7(1):110–120, 1997.
  104. Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998.
  105. Examples of adaptive MCMC. Journal of Computational and Graphical Statistics, 18(2):349–367, 2009.
  106. Donald B Rubin. Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, pages 1151–1172, 1984.
  107. Reuven Y. Rubinstein. Sensitivity analysis of discrete event systems by the “push out” method. Annals of Operations Research, 39(1):229–250, 1992.
  108. Approximating hidden Gaussian Markov random fields. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(4):877–892, 2004.
  109. D. Ruppert. Efficient Estimations from a Slowly Convergent Robbins-Monro Process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988.
  110. William Ruth. A review of Monte Carlo-based versions of the EM algorithm, 2024. arXiv:2401.00945.
  111. Local-global MCMC kernels: The best of both worlds. Advances in Neural Information Processing Systems, 35:5178–5193, 2022.
  112. Applied Stochastic Differential Equations, volume 10. Cambridge University Press, 2019.
  113. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83–112, 2017.
  114. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897. PMLR, 2015.
  115. On Markov chain gradient descent. Advances in Neural Information Processing Systems, 31, 2018.
  116. Pigeons.jl: Distributed sampling from intractable distributions. arXiv:2308.09769, 2023.
  117. Parallel tempering with a variational reference. Advances in Neural Information Processing Systems, 35:565–577, 2022.
  118. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, pages 1139–1147, 2013.
  119. Replica Monte Carlo simulation of spin-glasses. Physical Review Letters, 57(21):2607, 1986.
  120. Non-reversible parallel tempering: A scalable highly parallel MCMC scheme. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(2):321–350, 2022.
  121. Parallel tempering on optimized paths. In International Conference on Machine Learning, pages 10033–10042. PMLR, 2021.
  122. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.
  123. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217–233, 2010.
  124. Statistical inference in the Wright–Fisher model using allele frequency data. Systematic Biology, 66(1):e30–e46, 2017.
  125. Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17(7):1–33, 2016. Publisher: Journal of Machine Learning Research.
  126. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, pages 681–688, 2011.
  127. Richard Wilkinson. Accelerating ABC methods using Gaussian processes. In Artificial Intelligence and Statistics, pages 1015–1023, 2014.
  128. Simon N Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466(7310):1102–1104, 2010.
  129. Stochastic normalizing flows. Advances in Neural Information Processing Systems, 33:5933–5944, 2020.
  130. Laurent Younes. On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports, 65:177–228, 1999.
  131. Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):2008–2026, 2018.
  132. Differentiable annealed importance sampling and the perils of gradient noise. In Advances in Neural Information Processing Systems, 2021.

Summary

  • The paper introduces a novel approach that extends classical MCMC methods to solve dual optimization-integration (MOI) challenges in computational statistics.
  • It details an iterative framework employing Markov chain sampling and convergence theory to optimize parameters effectively using adaptive learning rates.
  • The study demonstrates practical algorithms like independence Metropolis-Hastings and transport map techniques to improve sampling efficiency in high-dimensional spaces.

MCMC-Driven Learning: A Unified Framework for Modern Computational Statistics

Overview

Markov chain Monte Carlo (MCMC) methods serve as a crucial foundation in the toolkit of modern computational statisticians and machine learners. Historically utilized for Bayesian posterior inference, MCMC methodologies have evolved, integrating with ML to address a broader array of optimization-integration (MOI) problems. MOI problems encapsulate tasks where both optimization and integration are performed simultaneously, often with the objective of minimizing some cost associated with parameter tuning based on Markov chain-generated sequences. This chapter introduces a comprehensive framework for understanding and addressing these MOI problems, leveraging MCMC alongside advanced ML techniques.

MOI Problems and Their Significance

MOI problems are characterized by their dual focus on optimization, typically of parameter selection within models or algorithms, and on integration, usually representing the inferential or predictive aspects associated with these parameters. This dual focus is reflective of the blending of inferential statistics with predictive machine learning, manifesting in a variety of applications including but not limited to adaptive MCMC, variational inference, and normalizing flow construction.

Techniques for Solving MOI Problems

Solving MOI problems effectively requires a robust set of tools and methodologies, detailed as follows:

  1. Framework and Problem Formulation: The framework for MOI leverages Markovian dynamics to navigate the parameter space. This involves iterative optimization steps informed by samples drawn according to Markov processes, where the target distribution itself may adapt based on these samples.
  2. Convergence Theory: A critical aspect of MOI problem-solving is ensuring convergence of the parameters to optimal points. This requires careful consideration of conditions such as the compactness and smoothness of the objective function, stability of the Markov kernels, and appropriate scheduling of the learning rates.
  3. Practical Algorithms: Practical approaches to MOI problems vary based on the specific application and desired outcomes. Examples include the use of independence Metropolis-Hastings (IMH) kernels for distribution approximation, and the incorporation of transport maps to improve sampling efficiency. Algorithms are often designed to balance exploration of the parameter space with exploitation of known gradients or sufficient statistics.

Case Study: MCMC-Driven Distribution Approximation

An instructive case paper within the MOI framework is the optimization of IMH proposals via minimization of the forward KL divergence. This involves identifying a parametric family of proposal distributions that approximate the target distribution closely enough to facilitate efficient sampling. Techniques such as parallel tempering can be employed to mitigate the curse of dimensionality in high-dimensional spaces, illustrating how MOI methods adapt classical MCMC strategies for modern computational challenges.

Advancements and Extensions

The MOI framework is not static; it continually evolves to incorporate new findings and methodologies from various disciplines. Recent advances include exploration of unadjusted Langevin dynamics for quicker convergence during the burn-in phase and the use of machine learning models to construct more expressive transport maps. Furthermore, methods to stabilize learning, such as the introduction of multiple references or "legs" in tempering paths, highlight the ongoing refinement of MOI strategies.

Conclusion

The MOI framework offers a powerful lens through which to view a wide range of problems at the intersection of MCMC methods and machine learning. By formalizing these problems within a unified theoretical and practical framework, researchers and practitioners can leverage the strengths of both inferential statistics and predictive modeling. This blended approach holds promise for tackling the increasingly complex computational challenges encountered in modern data analysis.