Efficient Training of Energy-Based Models Using Jarzynski Equality (2305.19414v2)
Abstract: Energy-based models (EBMs) are generative models inspired by statistical physics with a wide range of applications in unsupervised learning. Their performance is best measured by the cross-entropy (CE) of the model distribution relative to the data distribution. Using the CE as the objective for training is however challenging because the computation of its gradient with respect to the model parameters requires sampling the model distribution. Here we show how results for nonequilibrium thermodynamics based on Jarzynski equality together with tools from sequential Monte-Carlo sampling can be used to perform this computation efficiently and avoid the uncontrolled approximations made using the standard contrastive divergence algorithm. Specifically, we introduce a modification of the unadjusted Langevin algorithm (ULA) in which each walker acquires a weight that enables the estimation of the gradient of the cross-entropy at any step during GD, thereby bypassing sampling biases induced by slow mixing of ULA. We illustrate these results with numerical experiments on Gaussian mixture distributions as well as the MNIST dataset. We show that the proposed approach outperforms methods based on the contrastive divergence algorithm in all the considered situations.
- An introduction to variational autoencoders. Foundations and Trends in Machine Learning, 12(4):307–392, 2019.
- Carl Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.
- Infovae: Balancing learning and inference in variational autoencoders. In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 5885–5892, 2019.
- Generative adversarial imitation learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29, 2016.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217 – 233, 2010.
- A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.
- Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
- Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1):2617–2680, 2021.
- Riemannian continuous normalizing flows. Advances in Neural Information Processing Systems, 33:2503–2515, 2020.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021a.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020.
- Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34:1415–1428, 2021b.
- Geoffrey E Hinton. A practical guide to training restricted boltzmann machines. Neural Networks: Tricks of the Trade: Second Edition, pages 599–619, 2012.
- Efficient learning of deep boltzmann machines. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 693–700. JMLR Workshop and Conference Proceedings, 2010.
- Investigating convergence of restricted boltzmann machine learning. In NIPS 2010 Workshop on Deep Learning and Unsupervised Feature Learning, volume 1, pages 6–1, 2010.
- A tutorial on energy-based learning. In Gökhan BakIr, Thomas Hofmann, Alexander J Smola, Bernhard Schölkopf, and Ben Taskar, editors, Predicting structured data, chapter 10. MIT press, 2007.
- Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010.
- Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574–584. PMLR, 2020.
- Ludwig Boltzmann. Weitere studien über das wärmegleichgewicht unter gasmolekülen. Kinetische Theorie II, pages 115–225, 1970.
- Josiah Willard Gibbs. Elementary principles in statistical mechanics: developed with especial reference to the rational foundations of thermodynamics. C. Scribner’s sons, 1902.
- Statistical physics: theory of the condensed state, volume 9. Elsevier, 2013.
- A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6):1–35, 2021.
- Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
- Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
- A new learning algorithm for mean field boltzmann machines. In International conference on artificial neural networks, pages 351–357, 2002.
- On contrastive divergence learning. In International workshop on artificial intelligence and statistics, pages 33–40. PMLR, 2005.
- Aapo Hyvarinen. Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. IEEE Transactions on neural networks, 18(5):1529–1531, 2007.
- Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In International conference on Machine learning, pages 1064–1071, 2008.
- Unbiased markov chain monte carlo methods with couplings. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3), 2020.
- Unbiased contrastive divergence algorithm for training energy-based latent variable models. In International Conference on Learning Representations, 2020.
- Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems, 2019.
- Improved contrastive divergence training of energy-based models. In International Conference on Machine Learning, pages 2837–2848. PMLR, 2021.
- C Jarzynski. Nonequilibrium equality for free energy differences. Physical Review Letters, 78(14):2690, 1997.
- Sequential Monte Carlo methods in practice, volume 1. Springer, 2001.
- How to train your energy-based models. arXiv preprint arXiv:2101.03288, 2021.
- A theory of generative convnet. In International Conference on Machine Learning, pages 2635–2644. PMLR, 2016.
- Your classifier is secretly an energy based model and you should treat it like one. In International Conference on Learning Representations, 2019.
- Handbook of Markov chain Monte Carlo. CRC press, 2011.
- Monte Carlo strategies in scientific computing, volume 75. Springer, 2001.
- Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
- On autoencoders and score matching for energy based models. In International conference on machine learning (ICML-11), pages 1201–1208, 2011.
- Li Kevin Wenliang. On the failure of variational score matching for VAE models. arXiv preprint arXiv:2210.13390, 2022.
- Learning deep kernels for exponential family densities. In International Conference on Machine Learning, 2019.
- Generative modeling by estimating gradients of the data distribution. In Advances in neural information processing systems, volume 32, 2019.
- Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence, 42(1):27–45, 2018.
- Learning non-convergent non-persistent short-run MCMC toward energy-based model. In Advances in Neural Information Processing Systems, volume 32, 2019.
- Flow contrastive estimation of energy-based models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7518–7528, 2020.
- Vaebm: A symbiosis between variational autoencoders and energy-based models. In International Conference on Learning Representations, 2020.
- Learning energy-based model with variational auto-encoder as amortized sampler. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10441–10451, 2021a.
- A tale of two flows: Cooperative learning of langevin flow and normalizing flow toward energy-based model. In International Conference on Learning Representations, 2021b.
- Guiding energy-based models via contrastive latent variables. In The Eleventh International Conference on Learning Representations, 2022.
- Contrastive divergence learning is a time reversal adversarial game. In International Conference on Learning Representations, 2021.
- On energy-based models with overparametrized shallow neural networks. In International Conference on Machine Learning, pages 2771–2782, 2021a.
- Dual training of energy-based models with overparametrized shallow neural networks. arXiv preprint arXiv:2107.05134, 2021b.
- Radford M Neal. Annealed importance sampling. Statistics and computing, 11:125–139, 2001.
- Auto-encoding sequential monte carlo. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJ8c3f-0b.
- Learning deep generative models with annealed importance sampling, 2020.
- Flow annealed importance sampling bootstrap. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XCTVFJwS9LJ.
- Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International Conference on Machine Learning, pages 8489–8510. PMLR, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2256–2265, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/sohl-dickstein15.html.
- Score-based diffusion meets annealed importance sampling. In Advances in Neural Information Processing Systems, 2022.
- Bernt Oksendal. Stochastic Differential Equations. Springer-Verlag Berlin Heidelberg, 6 edition, 2003.
- Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise. Stochastic Processes and their Applications, 101(2):185–232, October 2002.
- Expansion of the global error for numerical schemes solving stochastic differential equations. Stochastic Analysis and Applications, 8(4):483–509, 1990.
- Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996. ISSN 13507265. URL http://www.jstor.org/stable/3318418.
- Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998.
- On adaptive resampling strategies for sequential Monte Carlo methods. Bernoulli, 18(1):252 – 278, 2012.
- Lawrence C Evans. Partial differential equations, volume 19. American Mathematical Society, 2022.
- Novel approach to nonlinear/non-gaussian bayesian state estimation. In IEE proceedings F (radar and signal processing), volume 140, pages 107–113. IET, 1993.
- Genshiro Kitagawa. Monte carlo filter and smoother for non-gaussian nonlinear state space models. Journal of computational and graphical statistics, 5(1):1–25, 1996.
- Improved particle filter for nonlinear problems. IEE Proceedings-Radar, Sonar and Navigation, 146(1):2–7, 1999.
- Resampling methods for particle filtering: classification, implementation, and strategies. IEEE Signal processing magazine, 32(3):70–86, 2015.
- Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023.