A Unified Theory of Exact Inference and Learning in Exponential Family Latent Variable Models (2404.19501v1)
Abstract: Bayes' rule describes how to infer posterior beliefs about latent variables given observations, and inference is a critical step in learning algorithms for latent variable models (LVMs). Although there are exact algorithms for inference and learning for certain LVMs such as linear Gaussian models and mixture models, researchers must typically develop approximate inference and learning algorithms when applying novel LVMs. In this paper we study the line that separates LVMs that rely on approximation schemes from those that do not, and develop a general theory of exponential family, latent variable models for which inference and learning may be implemented exactly. Firstly, under mild assumptions about the exponential family form of a given LVM, we derive necessary and sufficient conditions under which the LVM prior is in the same exponential family as its posterior, such that the prior is conjugate to the posterior. We show that all models that satisfy these conditions are constrained forms of a particular class of exponential family graphical model. We then derive general inference and learning algorithms, and demonstrate them on a variety of example models. Finally, we show how to compose our models into graphical models that retain tractable inference and learning. In addition to our theoretical work, we have implemented our algorithms in a collection of libraries with which we provide numerous demonstrations of our theory, and with which researchers may apply our theory in novel statistical settings.
- A learning algorithm for Boltzmann machines. Cognitive science, 9(1):147–169, 1985.
- S.-i. Amari and H. Nagaoka. Methods of Information Geometry, volume 191. American Mathematical Soc., 2007.
- Compatible conditional distributions. Journal of the American Statistical Association, 84(405):152–156, 1989.
- Conjugate Exponential Family Priors For Exponential Family Likelihoods. Statistics, 25(1):71–77, Jan. 1993. ISSN 0233-1888. doi: 10.1080/02331889308802432.
- Conditionally Specified Distributions: An Introduction (with comments and a rejoinder by the authors). Statistical Science, 16(3):249–274, Aug. 2001. ISSN 0883-4237, 2168-8745. doi: 10.1214/ss/1009213728.
- I. Assent. Clustering high dimensional data. WIREs Data Mining and Knowledge Discovery, 2(4):340–350, 2012. ISSN 1942-4795. doi: 10.1002/widm.1062.
- The functional diversity of retinal ganglion cells in the mouse. Nature, 529(7586):345–350, Jan. 2016. ISSN 1476-4687. doi: 10.1038/nature16468.
- Mixtures of Factor Analyzers with Common Factor Loadings: Applications to the Clustering and Visualization of High-Dimensional Data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(7):1298–1309, July 2010. ISSN 1939-3539. doi: 10.1109/TPAMI.2009.149.
- Marginalization in Neural Circuits with Divisive Normalization. The Journal of Neuroscience, 31(43):15310–15319, 2011.
- Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 3059–3067. Curran Associates, Inc., 2012.
- Y. Bengio and O. Delalleau. Justifying and generalizing contrastive divergence. Neural Computation, 21(6):1601–1621, 2009.
- J. Besag. Spatial Interaction and the Statistical Analysis of Lattice Systems. Journal of the Royal Statistical Society. Series B (Methodological), 36(2):192–236, 1974. ISSN 0035-9246.
- When Is “Nearest Neighbor” Meaningful? In C. Beeri and P. Buneman, editors, Database Theory — ICDT’99, Lecture Notes in Computer Science, pages 217–235, Berlin, Heidelberg, 1999. Springer. ISBN 978-3-540-49257-3. doi: 10.1007/3-540-49257-7˙15.
- C. M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York, 2006. ISBN 978-0-387-31073-2.
- Modeling Temporal Dependencies in High-dimensional Sequences: Application to Polyphonic Music Generation and Transcription. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, pages 1881–1888, USA, 2012. Omnipress. ISBN 978-1-4503-1285-1.
- W.-C. Chang. On Using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions. Journal of the Royal Statistical Society: Series C (Applied Statistics), 32(3):267–275, 1983. ISSN 1467-9876. doi: 10.2307/2347949.
- Neural population dynamics during reaching. Nature, 487(7405):51–56, July 2012. ISSN 1476-4687. doi: 10.1038/nature11129.
- Dimensionality reduction for large-scale neural recordings. Nature Neuroscience, 17(11):1500–1509, Nov. 2014. ISSN 1097-6256, 1546-1726. doi: 10.1038/nn.3776.
- P. Diaconis and D. Ylvisaker. Conjugate Priors for Exponential Families. The Annals of Statistics, 7(2):269–281, 1979. ISSN 0090-5364.
- A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research, 7:1141, Nov. 2020. ISSN 2046-1402. doi: 10.12688/f1000research.15666.3.
- D. Durstewitz. A state space approach for piecewise-linear recurrent neural networks for identifying computational dynamics from neural measurements. PLOS Computational Biology, 13(6):e1005542, June 2017. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1005542.
- A joint maximum-entropy model for binary neural population patterns and continuous signals. In Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009.
- Z. Ghahramani and G. E. Hinton. The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto, 1996.
- Generative Adversarial Nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
- PCA reduced Gaussian mixture models with applications in superresolution. Inverse Problems and Imaging, 16(2):341, 2022. doi: 10.3934/ipi.2021053.
- G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
- A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
- Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020.
- High-Dimensional Mixture Models for Unsupervised Image Denoising (HDMI). SIAM Journal on Imaging Sciences, 11(4):2815–2846, Jan. 2018. doi: 10.1137/17M1135694.
- E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge university press, 2003.
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Bayesian inference in ring attractor networks. Proceedings of the National Academy of Sciences, 120(9):e2210622120, Feb. 2023. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.2210622120.
- Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, Dec. 2015. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.aab3050.
- A unifying perspective on neural manifolds and circuits for cognition. Nature Reviews Neuroscience, 24(6):363–377, June 2023. ISSN 1471-0048. doi: 10.1038/s41583-023-00693-x.
- Bayesian encoding and decoding as distinct perspectives on neural coding. Nature Neuroscience, 26(12):2063–2072, Dec. 2023. ISSN 1546-1726. doi: 10.1038/s41593-023-01458-6.
- M. S. Lewicki. A review of methods for spike sorting: The detection and classification of neural action potentials. Network: Computation in Neural Systems, 9(4):R53–R78, Jan. 1998. ISSN 0954-898X. doi: 10.1088/0954-898X/9/4/001.
- Bayesian inference with probabilistic population codes. Nature Neuroscience, 9(11):1432–1438, Oct. 2006. ISSN 1097-6256. doi: 10.1038/nn1790.
- Finite Mixture Models. Annual Review of Statistics and Its Application, 6(1):355–378, 2019. doi: 10.1146/annurev-statistics-031017-100325.
- Parsimonious Gaussian mixture models. Statistics and Computing, 18(3):285–296, Sept. 2008. ISSN 1573-1375. doi: 10.1007/s11222-008-9056-0.
- K. P. Murphy. Probabilistic Machine Learning: Advanced Topics. MIT press, 2023.
- A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models, pages 355–368. Springer, 1998.
- Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434 [cs], Nov. 2015.
- S. Roweis and Z. Ghahramani. A Unifying Review of Linear Gaussian Models. Neural Computation, 11(2):305–345, Feb. 1999. ISSN 0899-7667. doi: 10.1162/089976699300016674.
- R. Salakhutdinov and G. Hinton. An Efficient Learning Procedure for Deep Boltzmann Machines. Neural Computation, 24(8):1967–2006, Apr. 2012. ISSN 0899-7667. doi: 10.1162/NECO˙a˙00311.
- Learning with Hierarchical-Deep Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1958–1971, Aug. 2013. ISSN 1939-3539. doi: 10.1109/TPAMI.2012.269.
- S. Särkkä. Bayesian Filtering and Smoothing. Number 3. Cambridge University Press, 2013.
- E. Schneidman. Towards the design principles of neural population codes. Current Opinion in Neurobiology, 37:133–140, Apr. 2016. ISSN 0959-4388. doi: 10.1016/j.conb.2016.03.001.
- VAE Approximation Error: ELBO and Exponential Families. In International Conference on Learning Representations, Oct. 2021.
- A probabilistic population code based on neural samples. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- A useful distribution for fitting discrete data: Revival of the Conway–Maxwell–Poisson distribution. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(1):127–142, 2005. ISSN 1467-9876. doi: 10.1111/j.1467-9876.2005.00474.x.
- J. Shore and R. W. Johnson. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. Information Theory, IEEE Transactions on, 26(1):26–37, 1980.
- P. Smolensky. Information Processing in Dynamical Systems: Foundations of Harmony Theory. Technical report, COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, Feb. 1986.
- S. Sokoloski. Implementing a Bayes Filter in a Neural Circuit: The Case of Unknown Stimulus Dynamics. Neural Computation, 29(9):2450–2490, June 2017. ISSN 0899-7667. doi: 10.1162/neco˙a˙00991.
- Modelling the neural code in large populations of correlated neurons. eLife, 10:e64615, Oct. 2021. ISSN 2050-084X. doi: 10.7554/eLife.64615.
- I. H. Stevenson. Flexible models for spike count data with both over- and under- dispersion. Journal of Computational Neuroscience, 41(1):29–43, Aug. 2016. ISSN 0929-5313, 1573-6873. doi: 10.1007/s10827-016-0603-y.
- I. Sutskever and T. Tieleman. On the convergence properties of contrastive divergence. In International Conference on Artificial Intelligence and Statistics, pages 789–795, 2010.
- Vector-Space Markov Random Fields via Exponential Families. In PMLR, pages 684–692, June 2015.
- Two distributed-state models for generating high-dimensional time series. Journal of Machine Learning Research, 12(Mar):1025–1068, 2011.
- Undirected Graphical Models as Approximate Posteriors. In Proceedings of the 37th International Conference on Machine Learning, pages 9680–9689. PMLR, Nov. 2020.
- E. Vértes and M. Sahani. Flexible and accurate inference and learning for deep generative models. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4166–4175. Curran Associates, Inc., 2018.
- Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1-2):1–305, 2008.
- Exponential Family Harmoniums with an Application to Information Retrieval. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1481–1488. MIT Press, 2005.
- D. M. Witten and R. Tibshirani. A Framework for Feature Selection in Clustering. Journal of the American Statistical Association, 105(490):713–726, June 2010. ISSN 0162-1459. doi: 10.1198/jasa.2010.tm09415.
- Graphical Models via Generalized Linear Models. In Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
- Graphical models via univariate exponential family distributions. Journal of Machine Learning Research, 16(1):3813–3847, 2015.