2000 character limit reached
MCMC-driven learning (2402.09598v1)
Published 14 Feb 2024 in stat.ML, cs.LG, math.ST, stat.CO, and stat.TH
Abstract: This paper is intended to appear as a chapter for the Handbook of Markov Chain Monte Carlo. The goal of this chapter is to unify various problems at the intersection of Markov chain Monte Carlo (MCMC) and machine learning$\unicode{x2014}$which includes black-box variational inference, adaptive MCMC, normalizing flow construction and transport-assisted MCMC, surrogate-likelihood MCMC, coreset construction for MCMC with big data, Markov chain gradient descent, Markovian score climbing, and more$\unicode{x2014}$within one common framework. By doing so, the theory and methods developed for each may be translated and generalized.
- Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5):3235–3249, 2012.
- Shun’ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
- Reparameterizing mirror descent as gradient descent. Advances in Neural Information Processing Systems, 33:8430–8439, 2020.
- Stability of stochastic approximation under verifiable conditions. In IEEE Conference on Decision and Control, 2005.
- On the ergodicity properties of some adaptive MCMC algorithms. The Annals of Applied Probability, 16(3):1462–1505, 2006.
- Controlled MCMC for optimal sampling. Cahiers du Cérémade 0125, 2001.
- A tutorial on adaptive MCMC. Statistics and Computing, 18:343–373, 2008.
- Markovian stochastic approximation with expanding projections. Bernoulli, 20(2):545–585, 2014.
- Patterns of scalable Bayesian inference. Foundations and Trends in Machine Learning, 9(2-3):119–247, 2016.
- Annealed flow transport Monte Carlo. In International Conference on Machine Learning, pages 318–330. PMLR, 2021.
- Towards optimal scaling of Metropolis-coupled Markov chain Monte Carlo. Statistics and Computing, 21:555–568, 2011.
- Automatic differentiation in machine learning: A survey. Journal of Machine Learning Research, 18:1–43, 2018.
- Approximate Bayesian computation in population genetics. Genetics, 162(4):2025–2035, 2002.
- Adaptive Algorithms and Stochastic Approximations, volume 22. Springer Science & Business Media, 2012.
- Automatic regenerative simulation via non-reversible simulated tempering. arXiv:2309.05578, 2023.
- Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
- SGD-QN: Careful quasi-Newton stochastic gradient descent. Journal of Machine Learning Research, 10(59):1737–1754, 2009.
- JAX: Composable transformations of Python+NumPy programs, 2018.
- Transport elliptical slice sampling. In Artificial Intelligence and Statistics, pages 3664–3676. PMLR, 2023.
- Recursive pathways to marginal likelihood estimation with prior-sensitivity analysis. Statistical Science, 29(3):397–419, 2014.
- Sparse variational inference: Bayesian coresets from scratch. Advances in Neural Information Processing Systems, 32, 2019.
- Bayesian coreset construction via greedy iterative geodesic ascent. In International Conference on Machine Learning, pages 698–706. PMLR, 2018.
- Automated scalable Bayesian inference via Hilbert coresets. The Journal of Machine Learning Research, 20(1):551–588, 2019.
- Making SGD parameter-free. In Proceedings of Thirty Fifth Conference on Learning Theory, pages 2360–2389. PMLR, 2022.
- Rao-Blackwellisation of sampling schemes. Biometrika, 83(1):81–94, 1996.
- Prediction, Learning, and Games. Cambridge University Press, Cambridge ; New York, 2006.
- Convergence and robustness of the Robbins–Monro algorithm truncated at randomly varying bounds. Stochastic Processes and their Applications, 27:217–231, 1988.
- Stochastic approximation procedures with randomly varying truncations. Scientia Sinica 1, 29:914–926, 1986.
- Coreset Markov chain Monte Carlo. arXiv:2310.17063, 2023.
- Bayesian inference via sparse Hamiltonian flows. Advances in Neural Information Processing Systems, 35:20876–20888, 2022.
- Air Markov chain Monte Carlo, 2018. arXiv:1801.09309.
- Variational MCMC. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pages 120–127, 2001.
- Truncated-Newton algorithms for large-scale unconstrained optimization. Mathematical Programming, 26(2):190–212, 1983.
- Density estimation using Real NVP. arXiv:1605.08803, 2016.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 2011.
- Neural spline flows. Advances in Neural Information Processing Systems, 32, 2019.
- High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A), 2019.
- Finite-time High-probability Bounds for Polyak-Ruppert Averaged Iterates of Linear Stochastic Approximation, 2023. arXiv:2207.04475.
- Donald L. Ermak. A computer simulation of charged particles in solution. The Journal of Chemical Physics, 62(10):4189–4196, 1975.
- Convergence of the Monte Carlo expectation maximization for curved exponential families. The Annals of Statistics, 31(4):1220–1259, 2003.
- Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.
- Adaptive Monte Carlo augmented with normalizing flows. Proceedings of the National Academy of Sciences, 119(10):e2109420119, 2022.
- Jørund Gåsemyr. On an adaptive version of the Metropolis–Hastings algorithm with independent proposal distribution. Scandinavian Journal of Statistics, 30(1):159–173, 2003.
- MCMC variational inference via uncorrected Hamiltonian annealing. In Advances in Neural Information Processing Systems, 2021.
- Charles J Geyer. Markov chain Monte Carlo maximum likelihood. Computing Science and Statistics, Proceedings of the 23rd Symposium on the Interface, pages 156–163, 1991.
- Annealing Markov chain Monte Carlo with applications to ancestral inference. Journal of the American Statistical Association, 90(431):909–920, 1995.
- Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2013.
- Peter W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
- Variance-reduced methods for machine learning. Proceedings of the IEEE, 108(11):1968–1983, 2020.
- On sampling with approximate transport maps. arXiv:2302.04763, 2023.
- An overview of stochastic quasi-Newton methods for large-scale machine learning. Journal of the Operations Research Society of China, 11(2):245–275, 2023.
- Bayesian optimization for likelihood-free inference of simulator-based statistical models. Journal of Machine Learning Research, 17(125):1–47, 2016.
- An adaptive Metropolis algorithm. Bernoulli, 7(2):223–242, 2001.
- Stochastic normalizing flows for inverse problems: a Markov chains viewpoint. SIAM/ASA Journal on Uncertainty Quantification, 10(3):1162–1190, 2022.
- NeuTra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. arXiv:1903.03704, 2019.
- Coresets for scalable Bayesian logistic regression. Advances in Neural Information Processing Systems, 29, 2016.
- Exchange Monte Carlo method and application to spin glass simulations. Journal of the Physical Society of Japan, 65(6):1604–1608, 1996.
- DoG is SGD’s best friend: A parameter-free dynamic step size schedule, 2023. arXiv:2302.12022.
- Martin Jankowiak and Du Phan. Surrogate likelihoods for variational annealed importance sampling. arXiv:2112.12194, 2021.
- Efficient acquisition rules for model-based approximate Bayesian computation. Bayesian Analysis, 14(2):595–622, 2019.
- Query efficient posterior estimation in scientific experiments via Bayesian active learning. Artificial Intelligence, 243:45–56, 2017.
- Markov chain score ascent: A unifying framework of variational inference with Markovian gradients. Advances in Neural Information Processing Systems, 35:34802–34816, 2022.
- A guide to sample average approximation. In Handbook of Simulation Optimization, volume 216, pages 207–243. Springer, 2015.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
- Automatic differentiation variational inference. Journal of Machine Learning Research, 2017.
- Stochastic gradient descent for nonconvex learning without bounded gradient assumptions. IEEE Transactions on Neural Networks and Learning Systems, 31(10):4394–4400, 2020.
- On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1):503–528, 1989.
- Sridhar Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22(1):159–195, 1996.
- Adaptive incremental mixture Markov chain Monte Carlo. Journal of Computational and Graphical Statistics, 28(4):790–805, 2019.
- Bayesian pseudocoresets. Advances in Neural Information Processing Systems, 33:14950–14960, 2020.
- An introduction to sampling via measure transport. arXiv:1602.05023, 2016.
- Continual repeated annealed flow transport Monte Carlo. In International Conference on Machine Learning, pages 15196–15219. PMLR, 2022.
- GPS-ABC: Gaussian process surrogate approximate Bayesian computation. In Uncertainty in Artificial Intelligence, pages 593–602, 2014.
- Warp Bridge Sampling. Journal of Computational and Graphical Statistics, 11(3):552–586, 2002.
- Monte Carlo gradient estimation in machine learning. Journal of Machine Learning Research, 21(132):1–62, 2020.
- Neural Networks: Tricks of the Trade: Second Edition, volume 7700 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2012.
- Jacques Morgenstern. How to compute fast a function and all its derivatives: a variation on the theorem of Baur-Strassen. ACM SIGACT News, 16(4):60–62, 1985.
- Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, volume 24, 2011.
- Elliptical slice sampling. In International Conference on Artificial Intelligence and Statistics, pages 541–548. JMLR Workshop and Conference Proceedings, 2010.
- Markovian score climbing: Variational inference with KL(p||q). Advances in Neural Information Processing Systems, 33:15499–15510, 2020.
- Fast Bayesian coresets via subsampling and quasi-Newton refinement. Advances in Neural Information Processing Systems, 35:70–83, 2022.
- Stephen G. Nash. A survey of truncated-Newton methods. Journal of Computational and Applied Mathematics, 124(1):45–59, 2000.
- Radford Neal. Bayesian learning via stochastic dynamics. In Advances in Neural Information Processing Systems, volume 5, 1992.
- Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, January 2009.
- Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience Series in Discrete Mathematics. Wiley, Chichester ; New York, 1983.
- Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2)𝑂1superscript𝑘2O(1/k^{2})italic_O ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Proceedings of the USSR Academy of Sciences, 1983.
- Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1):2617–2680, 2021.
- Sequential neural likelihood: Fast likelihood-free inference with autoregressive flows. In Artificial Intelligence and Statistics, pages 837–848, 2019.
- Perturbation corrections in approximate inference: Mixture modelling applications. Journal of Machine Learning Research, 10(43):1263–1304, 2009.
- Transport map accelerated Markov chain Monte Carlo. SIAM/ASA Journal on Uncertainty Quantification, 6(2):645–682, 2018.
- Boris Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
- Boris Polyak. New stochastic approximation type procedures. Avtomatica i Telemekhanika, 7:98–107, 1990.
- Bayesian synthetic likelihood. Journal of Computational and Graphical Statistics, 27(1):1–11, 2018.
- Making gradient descent optimal for strongly convex stochastic optimization. In International Conference on Machine Learning, 2012.
- Black box variational inference. In Artificial Intelligence and Statistics, pages 814–822. PMLR, 2014.
- Carl Rasmussen. Gaussian processes to speed up hybrid Monte Carlo for expensive Bayesian integrals. In Bayesian Statistics, pages 651–659, 2003.
- On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018.
- On the convergence of Adam and beyond. arXiv:1904.09237, 2019.
- Variational inference with normalizing flows. In International Conference on Machine Learning, 2015.
- Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.
- A stochastic approximation method. Annals of Mathematical Statistics, 22(3):400–407, 1951.
- Weak convergence and optimal scaling of random walk Metropolis algorithms. The Annals of Applied Probability, 7(1):110–120, 1997.
- Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998.
- Examples of adaptive MCMC. Journal of Computational and Graphical Statistics, 18(2):349–367, 2009.
- Donald B Rubin. Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, pages 1151–1172, 1984.
- Reuven Y. Rubinstein. Sensitivity analysis of discrete event systems by the “push out” method. Annals of Operations Research, 39(1):229–250, 1992.
- Approximating hidden Gaussian Markov random fields. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(4):877–892, 2004.
- D. Ruppert. Efficient Estimations from a Slowly Convergent Robbins-Monro Process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988.
- William Ruth. A review of Monte Carlo-based versions of the EM algorithm, 2024. arXiv:2401.00945.
- Local-global MCMC kernels: The best of both worlds. Advances in Neural Information Processing Systems, 35:5178–5193, 2022.
- Applied Stochastic Differential Equations, volume 10. Cambridge University Press, 2019.
- Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83–112, 2017.
- Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897. PMLR, 2015.
- On Markov chain gradient descent. Advances in Neural Information Processing Systems, 31, 2018.
- Pigeons.jl: Distributed sampling from intractable distributions. arXiv:2308.09769, 2023.
- Parallel tempering with a variational reference. Advances in Neural Information Processing Systems, 35:565–577, 2022.
- On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, pages 1139–1147, 2013.
- Replica Monte Carlo simulation of spin-glasses. Physical Review Letters, 57(21):2607, 1986.
- Non-reversible parallel tempering: A scalable highly parallel MCMC scheme. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(2):321–350, 2022.
- Parallel tempering on optimized paths. In International Conference on Machine Learning, pages 10033–10042. PMLR, 2021.
- A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.
- Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217–233, 2010.
- Statistical inference in the Wright–Fisher model using allele frequency data. Systematic Biology, 66(1):e30–e46, 2017.
- Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17(7):1–33, 2016. Publisher: Journal of Machine Learning Research.
- Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, pages 681–688, 2011.
- Richard Wilkinson. Accelerating ABC methods using Gaussian processes. In Artificial Intelligence and Statistics, pages 1015–1023, 2014.
- Simon N Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466(7310):1102–1104, 2010.
- Stochastic normalizing flows. Advances in Neural Information Processing Systems, 33:5933–5944, 2020.
- Laurent Younes. On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports, 65:177–228, 1999.
- Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):2008–2026, 2018.
- Differentiable annealed importance sampling and the perils of gradient noise. In Advances in Neural Information Processing Systems, 2021.