Approximating mutual information of high-dimensional variables using learned representations (2409.02732v1)
Abstract: Mutual information (MI) is a general measure of statistical dependence with widespread application across the sciences. However, estimating MI between multi-dimensional variables is challenging because the number of samples necessary to converge to an accurate estimate scales unfavorably with dimensionality. In practice, existing techniques can reliably estimate MI in up to tens of dimensions, but fail in higher dimensions, where sufficient sample sizes are infeasible. Here, we explore the idea that underlying low-dimensional structure in high-dimensional data can be exploited to faithfully approximate MI in high-dimensional settings with realistic sample sizes. We develop a method that we call latent MI (LMI) approximation, which applies a nonparametric MI estimator to low-dimensional representations learned by a simple, theoretically-motivated model architecture. Using several benchmarks, we show that unlike existing techniques, LMI can approximate MI well for variables with $> 103$ dimensions if their dependence structure has low intrinsic dimensionality. Finally, we showcase LMI on two open problems in biology. First, we approximate MI between protein LLM (pLM) representations of interacting proteins, and find that pLMs encode non-trivial information about protein-protein interactions. Second, we quantify cell fate information contained in single-cell RNA-seq (scRNA-seq) measurements of hematopoietic stem cells, and find a sharp transition during neutrophil differentiation when fate information captured by scRNA-seq increases dramatically.
- Positional information, in bits. Proc. Natl. Acad. Sci. U. S. A., 110(41):16301–16308, October 2013.
- Opening the black box of deep neural networks via information. arXiv [cs.LG], March 2017.
- Behavior of information flow near criticality. Phys Rev E, 103(1):L010102, January 2021.
- Detecting novel associations in large data sets. Science, 334(6062):1518–1524, December 2011.
- The low-rank hypothesis of complex systems. Nat. Phys., pages 1–9, January 2024.
- Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. U. S. A., 111(9):3354–3359, March 2014.
- Elements of Information Theory. Wiley & Sons, Incorporated, John, 2006.
- Neural methods for point-wise dependency estimation. arXiv [cs.LG], June 2020.
- Interpretable diffusion via information decomposition. arXiv [cs.LG], October 2023.
- Estimating mutual information. Phys. Rev. E Stat. Nonlin. Soft Matter Phys., 69(6 Pt 2):066138, June 2004.
- Estimation of mutual information for real-valued data with error bars and controlled bias. Phys Rev E, 100(2-1):022404, August 2019.
- Beyond normal: On the evaluation of mutual information estimators. arXiv [stat.ML], June 2023.
- Gene regulatory network inference from single-cell data using multivariate information measures. Cell Syst, 5(3):251–267.e3, September 2017.
- Sliced mutual information: A scalable measure of statistical dependence. arXiv [cs.IT], October 2021.
- MINE: Mutual information neural estimation. arXiv [cs.LG], January 2018.
- On variational bounds of mutual information. arXiv [cs.LG], May 2019.
- Formal limitations on the measurement of mutual information. In International Conference on Artificial Intelligence and Statistics, pages 875–884. PMLR, June 2020.
- Understanding the limitations of variational mutual information estimators. arXiv [cs.LG], October 2019.
- Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell, 161(5):1187–1201, May 2015.
- Compressed sensing, sparsity, and dimensionality in neuronal information processing and data analysis. Annu. Rev. Neurosci., 35:485–508, April 2012.
- Estimating information flow in deep neural networks. arXiv [cs.LG], October 2018.
- k-sliced mutual information: A quantitative study of scalability with dimension. arXiv [cs.IT], June 2022.
- Max-sliced mutual information. arXiv [cs.LG], September 2023.
- Scalable infomin learning. Adv. Neural Inf. Process. Syst., abs/2302.10701, February 2023.
- Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006.
- Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol., December 2018.
- Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol., 37(12):1482–1492, December 2019.
- Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, ICML ’08, pages 1096–1103, New York, NY, USA, July 2008. Association for Computing Machinery.
- Learning deep representations by mutual information estimation and maximization. arXiv [stat.ML], August 2018.
- Representation learning with contrastive predictive coding. arXiv [cs.LG], July 2018.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A., 118(15), April 2021.
- ProtTrans: Towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXiv [cs.LG], July 2020.
- Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods, 16(12):1315–1322, December 2019.
- Protein language models learn evolutionary statistics of interacting sequence motifs. bioRxiv, page 2024.01.30.577970, January 2024.
- Feature reuse and scaling: Understanding transfer learning with protein language models. bioRxiv, page 2024.02.05.578959, February 2024.
- Integrated intra- and intercellular signaling knowledge for multicellular omics analysis. Mol. Syst. Biol., 17(3):e9923, March 2021.
- CoSpar identifies early cell fate biases from single-cell transcriptomic and lineage information. Nat. Biotechnol., February 2022.
- Lineage tracing on transcriptional landscapes links state to fate during differentiation. Science, 367(6479), February 2020.
- Single-cell lineage capture across genomic modalities with CellTag-multi reveals fate-specific gene regulatory changes. Nat. Biotechnol., September 2023.
- SPRING: a kinetic interface for visualizing high dimensional single-cell expression data. Bioinformatics, 34(7):1246–1248, April 2018.
- Reconstructing cell histories in space with image-readable base editor recording. bioRxiv, page 2024.01.03.573434, January 2024.
- Lineage tracing meets single-cell omics: opportunities and challenges. Nat. Rev. Genet., 21(7):410–427, July 2020.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, March 2023.
- Information content and optimization of self-organized developmental systems. arXiv [physics.bio-ph], December 2023.
- Information processing in living systems. Annu. Rev. Condens. Matter Phys., 7(1):89–117, March 2016.
- D Bray. Protein molecules as computational elements in living cells. Nature, 376(6538):307–312, July 1995.
- Efficient estimation of mutual information for strongly dependent variables. arXiv [cs.IT], November 2014.
- A fair classifier using mutual information. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2521–2526. IEEE, June 2020.
- InfoFair: Information-theoretic intersectional fairness. In 2022 IEEE International Conference on Big Data (Big Data), pages 1455–1464. IEEE, December 2022.
- Quantifying the carbon emissions of machine learning. arXiv [cs.CY], October 2019.
- Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 2010. PMLR.
- PyTorch: An imperative style, high-performance deep learning library. arXiv [cs.LG], December 2019.
- Best practices for single-cell analysis across modalities. Nat. Rev. Genet., pages 1–23, March 2023.