Training Energy-Based Models with Diffusion Contrastive Divergences (2307.01668v1)
Abstract: Energy-Based Models (EBMs) have been widely used for generative modeling. Contrastive Divergence (CD), a prevailing training objective for EBMs, requires sampling from the EBM with Markov Chain Monte Carlo methods (MCMCs), which leads to an irreconcilable trade-off between the computational burden and the validity of the CD. Running MCMCs till convergence is computationally intensive. On the other hand, short-run MCMC brings in an extra non-negligible parameter gradient term that is difficult to handle. In this paper, we provide a general interpretation of CD, viewing it as a special instance of our proposed Diffusion Contrastive Divergence (DCD) family. By replacing the Langevin dynamic used in CD with other EBM-parameter-free diffusion processes, we propose a more efficient divergence. We show that the proposed DCDs are both more computationally efficient than the CD and are not limited to a non-negligible gradient term. We conduct intensive experiments, including both synthesis data modeling and high-dimensional image denoising and generation, to show the advantages of the proposed DCDs. On the synthetic data learning and image denoising experiments, our proposed DCD outperforms CD by a large margin. In image generation experiments, the proposed DCD is capable of training an energy-based model for generating the Celab-A $32\times 32$ dataset, which is comparable to existing EBMs.
- Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “A tutorial on energy-based learning,” Predicting structured data, vol. 1, no. 0, 2006.
- G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006.
- S.-C. Zhu, Y. N. Wu, and D. Mumford, “Filters, random fields and maximum entropy (frame): Towards a unified theory for texture modeling,” International Journal of Computer Vision, vol. 27, pp. 107–126, 2004.
- J. Xie, Y. Lu, S.-C. Zhu, and Y. Wu, “A theory of generative convnet,” in International Conference on Machine Learning. PMLR, 2016, pp. 2635–2644.
- R. Gao, Y. Song, B. Poole, Y. N. Wu, and D. P. Kingma, “Learning energy-based models by diffusion recovery likelihood,” arXiv preprint arXiv:2012.08125, 2020.
- E. Nijkamp, M. Hill, S.-C. Zhu, and Y. N. Wu, “Learning non-convergent non-persistent short-run mcmc toward energy-based model,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- Y. Zhao, J. Xie, and P. Li, “Learning energy-based generative models via coarse-to-fine expanding and sampling,” in ICLR, 2021.
- Y. Du and I. Mordatch, “Implicit generation and generalization in energy-based models,” arXiv preprint arXiv:1903.08689, 2019.
- W. Grathwohl, K.-C. Wang, J.-H. Jacobsen, D. Duvenaud, M. Norouzi, and K. Swersky, “Your classifier is secretly an energy based model and you should treat it like one,” arXiv preprint arXiv:1912.03263, 2019.
- S. Zhai, Y. Cheng, W. Lu, and Z. Zhang, “Deep structured energy based models for anomaly detection,” in International conference on machine learning. PMLR, 2016, pp. 1100–1109.
- W. Liu, X. Wang, J. Owens, and Y. Li, “Energy-based out-of-distribution detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 464–21 475, 2020.
- K. Lee, H. Yang, and S.-Y. Oh, “Adversarial training on joint energy based model for robust classification and out-of-distribution detection,” 2020 20th International Conference on Control, Automation and Systems (ICCAS), pp. 17–21, 2020.
- I. Mordatch, “Concept learning with energy-based models,” arXiv preprint arXiv:1811.02486, 2018.
- Y. Du, S. Li, and I. Mordatch, “Compositional visual generation with energy based models,” in NeurIPS, 2020.
- T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energy-based policies,” in ICML, 2017.
- J. Xie, S.-C. Zhu, and Y. N. Wu, “Synthesizing dynamic patterns by spatial-temporal generative convnet,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1061–1069, 2017.
- J. Xie, Z. Zheng, R. Gao, W. Wang, S.-C. Zhu, and Y. N. Wu, “Learning descriptor networks for 3d shape synthesis and analysis,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8629–8638, 2018.
- J. Ingraham, A. J. Riesselman, C. Sander, and D. S. Marks, “Learning protein structure with a differentiable simulator,” in ICLR, 2019.
- Y. Song and D. P. Kingma, “How to train your energy-based models,” arXiv preprint arXiv:2101.03288, 2021.
- G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.
- Y. Du, S. Li, J. Tenenbaum, and I. Mordatch, “Improved contrastive divergence training of energy based models,” arXiv preprint arXiv:2012.01316, 2020.
- T. Tieleman and G. Hinton, “Using fast weights to improve persistent contrastive divergence,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 1033–1040.
- J. Xie, Y. Zhu, J. Li, and P. Li, “A tale of two flows: Cooperative learning of langevin flow and normalizing flow toward energy-based model,” arXiv preprint arXiv:2205.06924, 2022.
- T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018.
- Q. Liu and D. Wang, “Learning deep energy models: Contrastive divergence vs. amortized mle,” arXiv preprint arXiv:1707.00797, 2017.
- S. Särkkä and A. Solin, “Applied stochastic differential equations,” 2019.
- H. Risken, “Fokker-planck equation,” 1984.
- Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020.
- Y. Song, C. Durkan, I. Murray, and S. Ermon, “Maximum likelihood training of score-based diffusion models,” Advances in Neural Information Processing Systems, vol. 34, pp. 1415–1428, 2021.
- T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” in Proc. NeurIPS, 2022.
- W. S. Grathwohl, J. J. Kelly, M. Hashemi, M. Norouzi, K. Swersky, and D. Duvenaud, “No {mcmc} for me: Amortized sampling for fast and stable training of energy-based models,” in International Conference on Learning Representations, 2021.
- Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising auto-encoders as generative models,” Advances in neural information processing systems, vol. 26, 2013.
- S. Lyu, “Unifying non-maximum likelihood learning objectives with minimum kl contraction,” Advances in Neural Information Processing Systems, vol. 24, 2011.
- M. F. Hutchinson, “A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines,” Communications in Statistics - Simulation and Computation, vol. 18, pp. 1059–1076, 1989.
- R. T. Q. Chen, J. Behrmann, D. K. Duvenaud, and J.-H. Jacobsen, “Residual flows for invertible generative modeling,” ArXiv, vol. abs/1906.02735, 2019.
- W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. K. Duvenaud, “Ffjord: Free-form continuous dynamics for scalable reversible generative models,” ArXiv, vol. abs/1810.01367, 2019.
- Y. Song, S. Garg, J. Shi, and S. Ermon, “Sliced score matching: A scalable approach to density and score estimation,” in UAI, 2019.
- Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=PxTIG12RRHS
- Y. Song and S. Ermon, “Improved techniques for training score-based generative models,” Advances in neural information processing systems, vol. 33, pp. 12 438–12 448, 2020.
- D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv: Learning, 2016.
- C. Meng, L. Yu, Y. Song, J. Song, and S. Ermon, “Autoregressive score matching,” ArXiv, vol. abs/2010.12810, 2020.
- T. Han, Y. Lu, S. Zhu, and Y. N. Wu, “Alternating back-propagation for generator network,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI), San Francisco, CA, 2017, pp. 1976–1984.
- E. Nijkamp, B. Pang, T. Han, L. Zhou, S. Zhu, and Y. N. Wu, “Learning multi-layer latent variable model via variational optimization of short run MCMC for approximate inference,” in Proceedings of the 16th European Conference on Computer Vision (ECCV, Part VI), Glasgow, UK, 2020, pp. 361–378.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proceedings of the 2nd International Conference on Learning Representations (ICLR), Banff, Canada, 2014.
- D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in Advances in Neural Information Processing Systems (NeurIPS), Montréal, Canada, 2018, pp. 10 236–10 245.
- A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” in Proceedings of the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2016.
- R. Gao, E. Nijkamp, D. P. Kingma, Z. Xu, A. M. Dai, and Y. N. Wu, “Flow contrastive estimation of energy-based models,” in Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, 2020, pp. 7515–7525.
- M. Arbel, L. Zhou, and A. Gretton, “Generalized energy based models,” in Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 2021.
- S. Zagoruyko and N. Komodakis, “Wide residual networks,” ArXiv, vol. abs/1605.07146, 2016.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
- T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” ArXiv, vol. abs/2206.00364, 2022.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local Nash equilibrium,” in Advances in Neural Information Processing Systems, 2017, pp. 6626–6637.
- J. Xie, Y. Zhu, J. L. Li, and P. Li, “A tale of two flows: Cooperative learning of langevin flow and normalizing flow toward energy-based model,” ArXiv, vol. abs/2205.06924, 2022.
- C. M. Stein, “Estimation of the mean of a multivariate normal distribution,” The annals of Statistics, pp. 1135–1151, 1981.
- S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
- S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural networks : the official journal of the International Neural Network Society, vol. 107, pp. 3–11, 2017.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.