Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Unnormalized Statistical Models via Compositional Optimization (2306.07485v1)

Published 13 Jun 2023 in cs.LG and math.OC

Abstract: Learning unnormalized statistical models (e.g., energy-based models) is computationally challenging due to the complexity of handling the partition function. To eschew this complexity, noise-contrastive estimation~(NCE) has been proposed by formulating the objective as the logistic loss of the real data and the artificial noise. However, as found in previous works, NCE may perform poorly in many tasks due to its flat loss landscape and slow convergence. In this paper, we study it a direct approach for optimizing the negative log-likelihood of unnormalized models from the perspective of compositional optimization. To tackle the partition function, a noise distribution is introduced such that the log partition function can be written as a compositional function whose inner function can be estimated with stochastic samples. Hence, the objective can be optimized by stochastic compositional optimization algorithms. Despite being a simple method, we demonstrate that it is more favorable than NCE by (1) establishing a fast convergence rate and quantifying its dependence on the noise distribution through the variance of stochastic estimators; (2) developing better results for one-dimensional Gaussian mean estimation by showing our objective has a much favorable loss landscape and hence our method enjoys faster convergence; (3) demonstrating better performance on multiple applications, including density estimation, out-of-distribution detection, and real image generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning, pp.  242–252, 2019.
  2. Learning deep latent variable models by short-run mcmc inference with optimal transport correction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15410–15419, 2021.
  3. Adversarial contrastive estimation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp.  1021–1032, 2018.
  4. Conditional noise-contrastive estimation of unnormalised models. In Proceedings of the 35th International Conference on Machine Learning, pp.  726–734, 2018.
  5. Solving stochastic compositional optimization is nearly as easy as solving stochastic optimization. IEEE Transactions on Signal Processing, 69:4937–4948, 2021.
  6. Christian P. Robert, G. C. Monte Carlo Statistical Methods. Springer, 2004.
  7. Momentum-based variance reduction in non-convex SGD. In Advances in Neural Information Processing Systems 32, pp. 15210–15219, 2019.
  8. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.
  9. Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems 32, pp. 3603–3613, 2019.
  10. Spider: Near-optimal non-convex optimization via stochastic path integrated differential estimator. ArXiv e-prints, arXiv:1807.01695, 2018.
  11. Learning generative convnets via multi-grid modeling and sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9155–9164, 2018.
  12. Bounds all around: training energy-based models with bidirectional bounds. In Advances in Neural Information Processing Systems 34, pp. 19808–19821, 2021.
  13. A single timescale stochastic approximation method for nested stochastic optimization. SIAM Journal on Optimization, 30(1):960–979, 2020.
  14. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672–2680, 2014.
  15. Grathwohl, W. Joint energy models. https://github.com/wgrathwohl/JEM, 2020.
  16. Your classifier is secretly an energy based model and you should treat it like one. In International Conference on Learning Representations, 2020a.
  17. Learning the stein discrepancy for training and evaluating energy-based models without sampling. In Proceedings of the 37th International Conference on Machine Learning, pp.  3732–3747, 2020b.
  18. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773, 2012.
  19. On stochastic moving-average estimators for non-convex optimization. ArXiv e-prints, arXiv:2104.14840, 2021.
  20. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pp.  297–304, 2010.
  21. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13(11):307–361, 2012.
  22. Divergence triangle for joint training of generator model, energy-based model, and inferential model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8670–8679, 2019.
  23. Hierarchical VAEs know what they don’t know. In Proceedings of the 38th International Conference on Machine Learning, pp.  4117–4128, 2021.
  24. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  770–778, 2016.
  25. Henaff, O. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the 37th International Conference on Machine Learning, pp.  4182–4192, 2020.
  26. Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
  27. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019.
  28. Hyvärinen, A. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005.
  29. Multi-block-single-probe variance reduced estimator for coupled compositional optimization. In Advances in Neural Information Processing Systems 35, 2022a.
  30. Optimal algorithms for stochastic multi-level compositional optimization. In Proceedings of the 39th International Conference on Machine Learning, pp.  10195–10216, 2022b.
  31. Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence, 2018.
  32. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases, pp. 795–811, 2016.
  33. A mutual information maximization perspective of language representation learning. In International Conference on Learning Representations, 2020.
  34. Krizhevsky, A. Learning multiple layers of features from tiny images. Masters Thesis, Deptartment of Computer Science, University of Toronto, 2009.
  35. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp.  2278–2324, 1998.
  36. A tutorial on energy-based learning. Predicting structured data, 2006.
  37. Analyzing and improving the optimization landscape of noise-contrastive estimation. In International Conference on Learning Representations, 2022a.
  38. Gradient-guided importance sampling for learning binary energy-based models. ArXiv e-prints, arXiv:2210.05782, 2022b.
  39. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in Neural Information Processing Systems 26, pp. 2265–2273, 2013.
  40. Autoregressive energy machines. In Proceedings of the 36th International Conference on Machine Learning, pp.  1735–1744, 2019.
  41. Reading digits in natural images with unsupervised feature learning. In Advances in Neural Information Processing Systems Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  42. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, pp.  2613–2621, 2017.
  43. On the anatomy of MCMC-based maximum likelihood learning of energy-based models. In Proceedings of the 33th AAAI Conference on Artificial Intelligence, 2019a.
  44. Learning non-convergent non-persistent short-run MCMC toward energy-based model. In Advances in Neural Information Processing Systems 32, pp. 5233–5243, 2019b.
  45. Efficient learning of generative models via finite-difference score matching. In Advances in Neural Information Processing Systems 33, pp. 19175–19188, 2020.
  46. An online method for a class of distributionally robust optimization with non-convex objectives. ArXiv e-prints, arXiv:2006.10138, 2021a.
  47. Stochastic optimization of areas under precision-recall curves with provable convergence. In Advances in Neural Information Processing Systems 34, pp. 1752–1765, 2021b.
  48. Likelihood ratios for out-of-distribution detection. In Advances in Neural Information Processing Systems 32, pp. 14680–14691, 2019.
  49. Telescoping density-ratio estimation. In Advances in Neural Information Processing Systems 33, pp. 4905–4916, 2020.
  50. Flow contrastive estimation of energy-based models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7518–7528, 2020.
  51. Seitzer, M. pytorch-fid: FID Score for PyTorch, 2020.
  52. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems 32, pp. 11895–11907, 2019.
  53. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence, 2019.
  54. Contrastive Multiview Coding. Springer, 2020.
  55. Tieleman, T. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th International Conference on Machine Learning, 2008.
  56. Vaart, A. W. v. d. Asymptotic Statistics. Cambridge University Press, 1998.
  57. Vincent, P. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
  58. Finite-sum coupled compositional stochastic optimization: Theory and applications. In Proceedings of the 39th International Conference on Machine Learning, pp.  23292–23317, 2022.
  59. Accelerating stochastic composition optimization. In Advances in Neural Information Processing Systems 29, pp. 1714–1722, 2016.
  60. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419–449, 2017a.
  61. Accelerating stochastic composition optimization. Journal of Machine Learning Research, 18:105:1–105:23, 2017b.
  62. Wasserman, L. All of Statistics. Springer, 2004.
  63. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, 2011.
  64. A theory of generative convnet. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, pp.  2635–2644, 2016.
  65. Cooperative learning of energy-based model and latent variable model via mcmc teaching. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  66. Learning energy-based model with variational auto-encoder as amortized sampler. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  67. A tale of two flows: Cooperative learning of langevin flow and normalizing flow toward energy-based model. In International Conference on Learning Representations, 2022.
  68. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. ArXiv e-prints, arXiv:1506.03365, 2015.
  69. Training deep energy-based models with f-divergence minimization. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp.  10957–10967, 2020.
  70. Wide residual networks. In Proceedings of the British Machine Vision Conference, 2016.
  71. A stochastic composite gradient method with incremental variance reduction. In Advances in Neural Information Processing Systems 32, pp. 9075–9085, 2019.
  72. Multilevel composite stochastic optimization via nested variance reduction. SIAM Journal on Optimization, 31(2):1131–1157, 2021.
  73. Learning energy-based generative models via coarse-to-fine expanding and sampling. In International Conference on Learning Representations, 2021.
Citations (3)

Summary

We haven't generated a summary for this paper yet.