MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph Data (2310.02275v1)
Abstract: Discovering genes with similar functions across diverse biomedical contexts poses a significant challenge in gene representation learning due to data heterogeneity. In this study, we resolve this problem by introducing a novel model called Multimodal Similarity Learning Graph Neural Network, which combines Multimodal Machine Learning and Deep Graph Neural Networks to learn gene representations from single-cell sequencing and spatial transcriptomic data. Leveraging 82 training datasets from 10 tissues, three sequencing techniques, and three species, we create informative graph structures for model training and gene representations generation, while incorporating regularization with weighted similarity learning and contrastive learning to learn cross-data gene-gene relationships. This novel design ensures that we can offer gene representations containing functional similarity across different contexts in a joint space. Comprehensive benchmarking analysis shows our model's capacity to effectively capture gene function similarity across multiple modalities, outperforming state-of-the-art methods in gene representation learning by up to 97.5%. Moreover, we employ bioinformatics tools in conjunction with gene representations to uncover pathway enrichment, regulation causal networks, and functions of disease-associated or dosage-sensitive genes. Therefore, our model efficiently produces unified gene representations for the analysis of gene functions, tissue functions, diseases, and species evolution.
- Gene ontology enrichment analysis. URL http://geneontology.org.
- Ingenuity pathway analysis. URL https://www.qiagenbioinformatics.com/products/ingenuitypathway-analysis.
- 10x Genomics. 10x genomics acquires spatial transcriptomics. Science, 2018. URL https://www.10xgenomics.com/news/10x-genomics-acquires-spatial-transcriptomics.
- Oma orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic acids research, 49(D1):D373–D379, 2021.
- False signals induced by single-cell imputation. F1000Research, 7, 2018.
- Covid-19 and exacerbation of dermatological diseases: a review of the available literature. Dermatologic Therapy, 34(6):e15113, 2021.
- Covid-19 and male reproductive system: pathogenic features and possible mechanisms. Journal of molecular histology, 52:869–878, 2021.
- Computational principles and challenges in single-cell data integration. Nature biotechnology, 39(10):1202–1215, 2021.
- Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018.
- Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- Depth normalization for single-cell genomics count data. bioRxiv, pages 2022–05, 2022.
- Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature biotechnology, 36(5):411–420, 2018.
- Graphnorm: A principled approach to accelerating graph neural network training. In International Conference on Machine Learning, pages 1204–1215. PMLR, 2021.
- A human cell atlas of fetal gene expression. Science, 370(6518):eaba7721, 2020.
- R Caruana. Multitask learning: A knowledge-based source of inductive bias1. In Proceedings of the Tenth International Conference on Machine Learning, pages 41–48. Citeseer, 1993.
- A unified analysis of atlas single cell data. bioRxiv, 2022.
- A rapid and robust method for single cell chromatin accessibility profiling. Nature communications, 9(1):1–9, 2018.
- Kenneth Ward Church. Word2vec. Natural Language Engineering, 23(1):155–162, 2017.
- Dgidb 3.0: a redesign and expansion of the drug–gene interaction database. Nucleic acids research, 46(D1):D1068–D1073, 2018.
- Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing. Science, 348(6237):910–914, 2015.
- Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science, 376(6594):eabl5197, 2022.
- Towards understanding and reducing graph structural noise for gnns. 2023.
- Spatialdwls: accurate deconvolution of spatial transcriptomic data. Genome biology, 22(1):145, 2021.
- Gene2vec: distributed representation of genes based on co-expression. BMC genomics, 20(1):7–15, 2019.
- Few-shot learning via learning the representation, provably. 2021. URL https://openreview.net/forum?id=pW2Q2xLwIMD.
- Multiple nocardial abscesses of cerebrum, cerebellum and spinal cord, causing quadriplegia. Clinical neurology and neurosurgery, 103(1):59–62, 2001.
- Multimodal learning with graphs. Nature Machine Intelligence, pages 1–11, 2023.
- HuBMAP Consortium et al. The human body at cellular resolution: the nih human biomolecular atlas program. Nature, 574(7777):187–192, 2019.
- HE Fessler. Heart-lung interactions: applications in the critically ill. European Respiratory Journal, 10(1):226–237, 1997.
- Inferring population dynamics from single-cell rna-sequencing time series data. Nature biotechnology, 37(4):461–468, 2019.
- Panglaodb: a web server for exploration of mouse and human single-cell rna sequencing data. Database, 2019:baz046, 2019.
- Functional and evolutionary implications of gene orthology. Nature Reviews Genetics, 14(5):360–366, 2013.
- Covid-19 infection during pregnancy induces differential gene expression in human cord blood cells from term neonates. Frontiers in Pediatrics, 10:547, 2022.
- node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.
- Normalization and variance stabilization of single-cell rna-seq data using regularized negative binomial regression. Genome biology, 20(1):1–15, 2019.
- Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
- Trrust v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic acids research, 46(D1):D380–D386, 2018.
- Construction of a human cell landscape at single-cell level. Nature, 581(7808):303–309, 2020.
- Integrated analysis of multimodal single-cell data. Cell, 184(13):3573–3587, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 594–604, 2022.
- Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
- What makes multi-modal learning better than single (provably). Advances in Neural Information Processing Systems, 34:10944–10956, 2021.
- Single-cell rna sequencing technologies and bioinformatics pipelines. Experimental & molecular medicine, 50(8):1–14, 2018.
- Some fundamental aspects about lipschitz continuity of neural network functions. arXiv preprint arXiv:2302.10886, 2023.
- Variational graph auto-encoders. NIPS Workshop on Bayesian Deep Learning, 2016.
- Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=SJU4ayYgl.
- Causal analysis approaches in ingenuity pathway analysis. Bioinformatics, 30(4):523–530, 2014.
- M Krassowski. Complexupset: Create complex upset plots using ggplot2 components. R package version 0.5, 18, 2020.
- Wgcna: an r package for weighted correlation network analysis. BMC bioinformatics, 9(1):1–13, 2008.
- Nih sennet consortium: Mapping senescent cells in the human body to understand health and disease, 2022.
- Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11(10):733–739, 2010.
- Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430, 2022.
- Clustering of single-cell multi-omics data with a multimodal deep learning method. Nature communications, 13(1):7705, 2022.
- Optimized approximation algorithm in neural networks without overfitting. IEEE transactions on neural networks, 19(6):983–995, 2008.
- Deep generative modeling for single-cell transcriptomics. Nature methods, 15(12):1053–1058, 2018.
- Benchmarking atlas-level data integration in single-cell genomics. Nature methods, 19(1):41–50, 2022.
- Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science, 357(6351):600–604, 2017.
- Improving graph representation learning by contrastive regularization. arXiv preprint arXiv:2101.11525, 2021.
- Rna-seq: from technology to biology. Cellular and molecular life sciences, 67:569–579, 2010.
- Umap: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29):861, 2018. doi: 10.21105/joss.00861. URL https://doi.org/10.21105/joss.00861.
- Bekir Fatih Meral. Parental views of families of children with autism spectrum disorder and developmental disorders during the covid-19 pandemic. Journal of Autism and Developmental Disorders, 52(4):1712–1724, 2022.
- Diganta Misra. Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681, 2019.
- Simple unsupervised graph representation learning. Proceedings of the AAAI Conference on Artificial Intelligence, 36(7):7797–7805, Jun. 2022. doi: 10.1609/aaai.v36i7.20748. URL https://ojs.aaai.org/index.php/AAAI/article/view/20748.
- Foundations of machine learning. MIT press, 2018.
- Highly scalable generation of dna methylation profiles in single cells. Nature biotechnology, 36(5):428–431, 2018.
- Neurological implications of covid-19 infections. Neurocritical care, 32:667–671, 2020.
- Human natural killer cells mediate adaptive immunity to viral antigens. Science immunology, 4(35):eaat8116, 2019.
- Genomicsupersignature facilitates interpretation of rna-seq experiments through robust, efficient comparison to public databases. Nature Communications, 13(1):3695, 2022.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Meeting the challenges of high-dimensional single-cell data analysis in immunology. Frontiers in immunology, 10:1515, 2019.
- Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
- Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35:14501–14515, 2022.
- Towards universal cell embeddings: Integrating single-cell rna-seq datasets across species with saturn. Biorxiv: the Preprint Server for Biology, 2023.
- The human cell atlas: from vision to reality. Nature, 550(7677):451–453, 2017.
- Single-cell rna-seq: advances and future challenges. Nucleic acids research, 42(14):8845–8860, 2014.
- Hereditary angioedema and covid-19 during pregnancy: Two case reports. The Journal of Allergy and Clinical Immunology: In Practice, 11(3):961–962, 2023.
- pySankey, 2018. URL https://github.com/anazalea/pySankey.
- Masked label prediction: Unified message passing model for semi-supervised classification. In Zhi-Hua Zhou, editor, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 1548–1554. International Joint Conferences on Artificial Intelligence Organization, 8 2021. doi: 10.24963/ijcai.2021/214. URL https://doi.org/10.24963/ijcai.2021/214. Main Track.
- Evaluating measures of association for single-cell transcriptomics. Nature methods, 16(5):381–386, 2019.
- Role of sars-cov-2 in altering the rna-binding protein and mirna-directed post-transcriptional regulatory networks in humans. International journal of molecular sciences, 21(19):7090, 2020.
- Hearing loss and covid-19: a note. American journal of otolaryngology, 2020.
- Simultaneous epitope and transcriptome measurement in single cells. Nature methods, 14(9):865–868, 2017.
- Dynamic genetic regulation of gene expression during cellular differentiation. Science, 364(6447):1287–1290, 2019.
- Comprehensive integration of single-cell data. Cell, 177(7):1888–1902, 2019.
- Cell-type-specific co-expression inference from single cell rna-sequencing data. Nature Communications, 14(1):4846, 2023.
- Masashi Sugiyama. Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis. Journal of machine learning research, 8(5), 2007.
- Transfer learning enables predictions in network biology. Nature, pages 1–9, 2023.
- From louvain to leiden: guaranteeing well-connected communities. Scientific reports, 9(1):1–12, 2019.
- On the theory of transfer learning: The importance of task diversity. Advances in neural information processing systems, 33:7852–7862, 2020.
- Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434–10443. PMLR, 2021.
- Recovering gene interactions from single-cell data using data diffusion. Cell, 174(3):716–729, 2018.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Graph attention networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJXMpikCZ.
- Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261–272, 2020.
- Multi-hop attention graph neural networks. In Zhi-Hua Zhou, editor, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 3089–3096. International Joint Conferences on Artificial Intelligence Organization, 8 2021a. doi: 10.24963/ijcai.2021/425. URL https://doi.org/10.24963/ijcai.2021/425. Main Track.
- scgnn is a novel graph neural network framework for single-cell rna-seq analyses. Nature communications, 12(1):1–11, 2021b.
- VD-BERT: A Unified Vision and Dialog Transformer with BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3325–3338, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.269. URL https://aclanthology.org/2020.emnlp-main.269.
- Respan: a powerful batch correction model for scrna-seq data through residual adversarial networks. Bioinformatics, 38(16):3942–3949, 2022.
- Hadley Wickham. ggplot2. Wiley interdisciplinary reviews: computational statistics, 3(2):180–185, 2011.
- Gemini: memory-efficient integration of hundreds of gene networks with high-order pooling. Bioinformatics, 39, 06 2023. doi: 10.1093/bioinformatics/btad247.
- Scanpy: large-scale single-cell gene expression data analysis. Genome biology, 19(1):1–5, 2018.
- A generalizable model-and-data driven approach for open-set rff authentication. IEEE Transactions on Information Forensics and Security, 16:4435–4450, 2021. doi: 10.1109/TIFS.2021.3106166.
- How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
- Tf-marker: a comprehensive manually curated database for transcription factors and related markers in specific cell and tissue types in human. Nucleic acids research, 50(D1):D402–D412, 2022.
- scbert as a large-scale pretrained deep language model for cell type annotation of single-cell rna-seq data. Nature Machine Intelligence, 4(10):852–866, 2022.
- Multi-domain translation between single-cell imaging and sequencing data using autoencoders. Nature communications, 12(1):1–10, 2021.
- Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34:28877–28888, 2021.
- Rethinking the expressive power of GNNs via graph biconnectivity. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=r9hNv76KoT3.
- Spatial epigenome–transcriptome co-profiling of mammalian tissues. Nature, pages 1–10, 2023b.
- Applications of transformer-based language models in bioinformatics: A survey. Bioinformatics Advances, 2023c.
- Massively parallel digital transcriptional profiling of single cells. Nature communications, 8(1):1–12, 2017.
- Auto-gnn: Neural architecture search of graph neural networks. Frontiers in Big Data, 5, 2022. ISSN 2624-909X. doi: 10.3389/fdata.2022.1029307. URL https://www.frontiersin.org/articles/10.3389/fdata.2022.1029307.
- Spark-x: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biology, 22(1):1–25, 2021a.
- Graph neural networks with heterophily. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12):11168–11176, May 2021b. doi: 10.1609/aaai.v35i12.17332. URL https://ojs.aaai.org/index.php/AAAI/article/view/17332.
- Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131, 2020.
- Predicting multicellular function through multi-layer tissue networks. Bioinformatics, 33(14):i190–i198, 2017.