Characterizing Dependence of Samples along the Langevin Dynamics and Algorithms via Contraction of $Φ$-Mutual Information (2402.17067v3)
Abstract: The mixing time of a Markov chain determines how fast the iterates of the Markov chain converge to the stationary distribution; however, it does not control the dependencies between samples along the Markov chain. In this paper, we study the question of how fast the samples become approximately independent along popular Markov chains for continuous-space sampling: the Langevin dynamics in continuous time, and the Unadjusted Langevin Algorithm and the Proximal Sampler in discrete time. We measure the dependence between samples via $\Phi$-mutual information, which is a broad generalization of the standard mutual information, and which is equal to $0$ if and only if the the samples are independent. We show that along these Markov chains, the $\Phi$-mutual information between the first and the $k$-th iterate decreases to $0$ exponentially fast in $k$ when the target distribution is strongly log-concave. Our proof technique is based on showing the Strong Data Processing Inequalities (SDPIs) hold along the Markov chains. To prove fast mixing of the Markov chains, we only need to show the SDPIs hold for the stationary distribution. In contrast, to prove the contraction of $\Phi$-mutual information, we need to show the SDPIs hold along the entire trajectories of the Markov chains; we prove this when the iterates along the Markov chains satisfy the corresponding $\Phi$-Sobolev inequality, which is implied by the strong log-concavity of the target distribution.
- Shifted composition I: Harnack and reverse transport inequalities. arXiv preprint arXiv:2311.14520, 2023.
- The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory, 59(1):573–587, 2012.
- Resolving the mixing time of the langevin algorithm to its stationary distribution for log-concave sampling. In Proceedings of Thirty Sixth Conference on Learning Theory, Proceedings of Machine Learning Research. PMLR, 2023.
- Mutual information, relative entropy, and estimation in the Poisson channel. IEEE Transactions on Information theory, 58(3):1302–1318, 2012.
- Hypercontractivity of Hamilton–Jacobi equations. Journal de Mathématiques Pures et Appliquées, 80(7):669–696, 2001.
- Analysis and geometry of Markov diffusion operators, volume 103. Springer, 2014.
- Nelson Blachman. The convolution inequality for entropy powers. IEEE Transactions on Information theory, 11(2):267–271, 1965.
- Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
- Richard C. Bradley. Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions. Probability Surveys, 2:107 – 144, 2005.
- Convergence of Langevin MCMC in KL-divergence. In Algorithmic Learning Theory, pages 186–211. PMLR, 2018.
- Markov chain Monte Carlo convergence diagnostics: a comparative review. Journal of the American Statistical Association, 91(434):883–904, 1996.
- Improved analysis for a proximal algorithm for sampling. In Conference on Learning Theory, pages 2984–3014. PMLR, 2022.
- Analysis of Langevin Monte Carlo from Poincaré to log-Sobolev. In Proceedings of Thirty Fifth Conference on Learning Theory, Proceedings of Machine Learning Research, pages 1–2. PMLR, 2022.
- Sinho Chewi. Log-concave sampling. 2023. Draft available at: https://chewisinho.github.io.
- Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
- Roland L Dobrushin. Central limit theorem for nonstationary Markov chains I. Theory of Probability & Its Applications, 1(1):65–80, 1956.
- Quantitative harris-type theorems for diffusions and mckean–vlasov processes. Transactions of the American Mathematical Society, 2019.
- Bayesian data analysis. Chapman & Hall/CRC, 1995.
- Markov chain Monte Carlo in practice. CRC press, 1995.
- Mutual information and minimum mean-square error in Gaussian channels. IEEE Transactions on Information Theory, 51(4):1261–1282, 2005.
- Estimating Information Flow in Deep Neural Networks. In International Conference on Machine Learning, pages 2299–2308. PMLR, 2019.
- Logarithmic Sobolev inequalities and stochastic Ising models. Journal of Statistical Physics, 46:1159–1194, 1987.
- The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
- Curvature, concentration and error estimates for Markov chain Monte Carlo. The Annals of Probability, 38, 2010.
- Galin L Jones. On the Markov chain central limit theorem. Probability Surveys, 1:299–320, 2004.
- MCMC Methods for Continuous-Time Financial Econometric. Handbook of Financial Econometrics, 2, 2003.
- Markov chain Monte Carlo in practice: a roundtable discussion. The American Statistician, 52(2):93–100, 1998.
- Markov chains and mixing times, volume 107. American Mathematical Society, 2017.
- David JC MacKay. Information theory, inference and learning algorithms. Cambridge University Press, 2003.
- Generalization of an Inequality by Talagrand and Links with the Logarithmic Sobolev Inequality. Journal of Functional Analysis, 173(2):361–400, 2000.
- Comment on: “Hypercontractivity of Hamilton–Jacobi equations”’, by S. Bobkov, I. Gentil and M. Ledoux. Journal de Mathématiques Pures et Appliquées, 80(7):697–700, 2001.
- Strong data-processing inequalities for channels and bayesian networks. In Convexity and Concentration, pages 211–249. Springer, 2017.
- Information Theory: From Coding to Learning. Cambridge University Press, 2024. Draft available at: http://www.stat.yale.edu/~yw562/teaching/itbook-export.pdf.
- Monte Carlo statistical methods, volume 2. Springer, 1999.
- Aart J Stam. Some inequalities satisfied by the quantities of information of Fisher and Shannon. Information and Control, 2(2):101–112, 1959.
- Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2015.
- Cédric Villani. Optimal Transport: Old and New. Springer, 2009.
- Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021.
- Udo Von Toussaint. Bayesian inference in physics. Reviews of Modern Physics, 83(3):943, 2011.
- Rapid convergence of the Unadjusted Langevin Algorithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Feng-Yu Wang. Logarithmic sobolev inequalities on noncompact riemannian manifolds. Probability theory and related fields, 109:417–424, 1997.
- Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pages 2093–3027. PMLR, 2018.
- Convexity of mutual information along the heat flow. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 1615–1619. IEEE, 2018.
- Convexity of mutual information along the Ornstein–Uhlenbeck flow. In 2018 International Symposium on Information Theory and Its Applications (ISITA), pages 55–59. IEEE, 2018.
- Information and estimation in Fokker–Planck channels. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 2673–2677. IEEE, 2017.
- Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems, 30, 2017.
- On a class of gibbs sampling over networks. In Proceedings of Thirty Sixth Conference on Learning Theory, Proceedings of Machine Learning Research, 2023.