Fit Like You Sample: Sample-Efficient Generalized Score Matching from Fast Mixing Diffusions (2306.09332v3)
Abstract: Score matching is an approach to learning probability distributions parametrized up to a constant of proportionality (e.g. Energy-Based Models). The idea is to fit the score of the distribution, rather than the likelihood, thus avoiding the need to evaluate the constant of proportionality. While there's a clear algorithmic benefit, the statistical "cost'' can be steep: recent work by Koehler et al. 2022 showed that for distributions that have poor isoperimetric properties (a large Poincar\'e or log-Sobolev constant), score matching is substantially statistically less efficient than maximum likelihood. However, many natural realistic distributions, e.g. multimodal distributions as simple as a mixture of two Gaussians in one dimension -- have a poor Poincar\'e constant. In this paper, we show a close connection between the mixing time of a broad class of Markov processes with generator $\mathcal{L}$ and an appropriately chosen generalized score matching loss that tries to fit $\frac{\mathcal{O} p}{p}$. This allows us to adapt techniques to speed up Markov chains to construct better score-matching losses. In particular, preconditioning'' the diffusion can be translated to an appropriate
preconditioning'' of the score loss. Lifting the chain by adding a temperature like in simulated tempering can be shown to result in a Gaussian-convolution annealed score matching loss, similar to Song and Ermon, 2019. Moreover, we show that if the distribution being learned is a finite mixture of Gaussians in $d$ dimensions with a shared covariance, the sample complexity of annealed score matching is polynomial in the ambient dimension, the diameter of the means, and the smallest and largest eigenvalues of the covariance -- obviating the Poincar\'e constant-based lower bounds of the basic score matching loss shown in Koehler et al. 2022.
- Nonlinear bayesian estimation using gaussian sum approximations. IEEE transactions on automatic control, 17(4):439–448, 1972.
- Diffusions hypercontractives. In Séminaire de Probabilités XIX 1983/84: Proceedings, pages 177–206. Springer, 2006.
- Robustly learning mixtures of k arbitrary gaussians. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1234–1247, 2022.
- Statistical guarantees for the em algorithm: From population to sample-based analysis. 2017.
- Minimum stein discrepancy estimators. Advances in Neural Information Processing Systems, 32, 2019.
- Mario Bebendorf. A note on the poincaré inequality for convex domains. Zeitschrift für Analysis und ihre Anwendungen, 22(4):751–756, 2003.
- Polynomial learning of distribution families. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 103–112. IEEE, 2010.
- Convex optimization. Cambridge university press, 2004.
- On explicit l 2-convergence rate estimate for underdamped langevin dynamics. Archive for Rational Mechanics and Analysis, 247(5):90, 2023.
- Dimension-free log-sobolev inequalities for mixture distributions. Journal of Functional Analysis, 281(11):109236, 2021.
- Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. arXiv preprint arXiv:2211.01916, 2022.
- Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.
- G Constantine and T Savits. A multivariate faa di bruno formula with applications. Transactions of the American Mathematical Society, 348(2):503–520, 1996.
- Sanjoy Dasgupta. Learning mixtures of gaussians. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 634–644. IEEE, 1999.
- Ten steps of em suffice for mixtures of two gaussians. In Conference on Learning Theory, pages 704–710. PMLR, 2017.
- Geometric bounds for eigenvalues of markov chains. The annals of applied probability, pages 36–61, 1991.
- Score-based generative modeling with critically-damped langevin diffusion. arXiv preprint arXiv:2112.07068, 2021.
- Parallel tempering: Theory, applications, and new perspectives. Physical Chemistry Chemical Physics, 7(23):3910–3916, 2005.
- Simulated tempering langevin monte carlo ii: An improved proof using soft markov chain decomposition. arXiv preprint arXiv:1812.00793, 2018.
- Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society Series B: Statistical Methodology, 73(2):123–214, 2011.
- A two-scale approach to logarithmic sobolev inequalities and the hydrodynamic limit. In Annales de l’IHP Probabilités et statistiques, volume 45, pages 302–351, 2009.
- Björn Holmquist. The d-variate vector hermite polynomial of order k. Linear algebra and its applications, 237:155–190, 1996.
- Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1021–1034, 2018.
- Exchange monte carlo method and application to spin glass simulations. Journal of the Physical Society of Japan, 65(6):1604–1608, 1996.
- Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
- Score function features for discriminative learning: Matrix and tensor framework. arXiv preprint arXiv:1412.2863, 2014.
- Statistical efficiency of score matching: The view from isoperimetry. arXiv preprint arXiv:2210.00726, 2022.
- Beyond log-concavity: Provable guarantees for sampling multi-modal distributions using simulated tempering langevin monte carlo. Advances in neural information processing systems, 31, 2018.
- Convergence for score-based generative modeling with polynomial complexity. arXiv preprint arXiv:2206.06227, 2022.
- Convergence of score-based generative modeling for general data distributions. In International Conference on Algorithmic Learning Theory, pages 946–985. PMLR, 2023.
- Tony Lelièvre. A general two-scale criteria for logarithmic sobolev inequalities. Journal of Functional Analysis, 256(7):2211–2221, 2009.
- Preconditioned stochastic gradient langevin dynamics for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
- Siwei Lyu. Interpretation and generalization of score matching. arXiv preprint arXiv:1205.2629, 2012.
- A complete recipe for stochastic gradient mcmc. Advances in neural information processing systems, 28, 2015.
- Markov chain decomposition for convergence rate analysis. Annals of Applied Probability, pages 581–606, 2002.
- Simulated tempering: a new monte carlo scheme. Europhysics letters, 19(6):451, 1992.
- Estimating high order gradients of the data distribution by denoising. Advances in Neural Information Processing Systems, 34:25359–25369, 2021.
- Fast convergence for langevin diffusion with manifold structure. arXiv preprint arXiv:2002.05576, 2020.
- Settling the polynomial learnability of mixtures of gaussians. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 93–102. IEEE, 2010.
- Radford M Neal. Sampling from multimodal distributions using tempered transitions. Statistics and computing, 6:353–366, 1996.
- A new criterion for the logarithmic sobolev inequality and two applications. Journal of Functional Analysis, 243(1):121–157, 2007.
- Provable benefits of score matching. arXiv preprint arXiv:2306.01993, 2023.
- Diffusions, Markov processes and martingales: Volume 2, Itô calculus, volume 2. Cambridge university press, 2000.
- Yasumasa Saisho. Stochastic differential equations for multi-dimensional domain with reflecting boundary. Probability Theory and Related Fields, 74(3):455–477, 1987.
- Learning mixtures of arbitrary gaussians. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 247–257, 2001.
- Stochastic quasi-newton langevin monte carlo. In International Conference on Machine Learning, pages 642–651. PMLR, 2016.
- Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Replica monte carlo simulation of spin-glasses. Physical review letters, 57(21):2607, 1986.
- Henry Teicher. Identifiability of finite mixtures. The annals of Mathematical statistics, pages 1265–1269, 1963.
- Alexis Akira Toda. Operator reverse monotonicity of the inverse. The American Mathematical Monthly, 118(1):82–83, 2011.
- Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
- Sufficient conditions for torpid mixing of parallel and simulated tempering. 2009a.
- Conditions for rapid mixing of parallel and simulated tempering on multimodal distributions. 2009b.
- On the identifiability of finite mixtures. The Annals of Mathematical Statistics, 39(1):209–214, 1968.
- Sequential markov chain monte carlo. arXiv preprint arXiv:1308.3861, 2013.