Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity (2405.05998v2)
Abstract: Leveraging the vast genetic diversity within microbiomes offers unparalleled insights into complex phenotypes, yet the task of accurately predicting and understanding such traits from genomic data remains challenging. We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. Based on our model, we develop attribution techniques to elucidate gene interaction effects that drive microbial adaptation to diverse environments. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats. We not only demonstrate solid predictive performance, but also how sequence-level information of entire genomes allows us to identify gene associations underlying complex phenotypes. Our attribution recovers known important interaction networks and proposes new candidates for experimental follow up.
- Kaust metagenomic analysis platform (kmap), enabling access to massive analytics of re-annotated metagenomic data. Scientific reports, 11(1):11511, 2021.
- A review of deep learning applications in human genomics using next-generation sequencing data. Human Genomics, 16(1):1–20, 2022.
- A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature biotechnology, 39(1):105–114, 2021.
- Basic local alignment search tool. J Mol Biol, 215(3):403–410, October 1990.
- Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18(10):1196–1203, October 2021.
- Nmpfamsdb: a database of novel protein families from microbial metagenomes and metatranscriptomes. Nucleic Acids Research, 52(D1):D502–D512, 2024.
- BacPaCS—Bacterial Pathogenicity Classification via Sparse-SVM. Bioinformatics, 35(12):2001–2008, 11 2018. ISSN 1367-4803. doi: 10.1093/bioinformatics/bty928. URL https://doi.org/10.1093/bioinformatics/bty928.
- Longformer: The long-document transformer, 2020.
- Dna language models are powerful predictors of genome-wide variant effects. Proceedings of the National Academy of Sciences, 120(44):e2311219120, 2023.
- Multiple comparative metagenomics using multiset k-mer counting, 2016.
- Different modes of regulation of the expression of dextransucrase in leuconostoc lactis av1n and lactobacillus sakei mn1. Frontiers in Microbiology, 10, 2019. ISSN 1664-302X. doi: 10.3389/fmicb.2019.00959. URL https://www.frontiersin.org/articles/10.3389/fmicb.2019.00959.
- SSLpheno: a self-supervised learning approach for gene–phenotype association prediction using protein–protein interactions and gene ontology data. Bioinformatics, 39(11):btad662, 11 2023. ISSN 1367-4811. doi: 10.1093/bioinformatics/btad662. URL https://doi.org/10.1093/bioinformatics/btad662.
- Magic-BLAST, an accurate RNA-seq aligner for long and short reads. BMC Bioinformatics, 20(1):405, July 2019.
- Parallel genome reduction in symbionts descended from closely related free-living bacteria. Nature Ecology & Evolution, 1(8):1160–1167, August 2017.
- Surveying gut microbiome research in africans: toward improved diversity and representation. Trends in microbiology, 27(10):824–835, 2019.
- Prophage exotoxins enhance colonization fitness in epidemic scarlet fever-causing streptococcus pyogenes. Nature Communications, 11(1):5018, October 2020.
- Calle, M. L. Statistical analysis of metagenomics data. Genomics & informatics, 17(1), 2019.
- ElasticBLAST: accelerating sequence search via cloud computing. BMC Bioinformatics, 24(1):117, March 2023.
- eggnog-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Molecular biology and evolution, 38(12):5825–5829, 2021a.
- eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Molecular Biology and Evolution, 38(12):5825–5829, 10 2021b. ISSN 1537-1719. doi: 10.1093/molbev/msab293. URL https://doi.org/10.1093/molbev/msab293.
- Door: a prokaryotic operon database for genome analyses and functional inference. Briefings in bioinformatics, 20(4):1568–1577, 2019.
- Cheifet, B. Where is genomics going next?, 2019.
- Generating long sequences with sparse transformers, 2019.
- Checkm2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nature Methods, 20(8):1203–1212, 2023.
- Transformer architecture and attention mechanisms in genome data analysis: a comprehensive review. Biology, 12(7):1033, 2023.
- Gut microbiota’s effect on mental health: The gut-brain axis. Clinics and practice, 7(4):987, 2017.
- A phylogenetic method to perform genome-wide association studies in microbes that accounts for population structure and recombination. PLoS computational biology, 14(2):e1005958, 2018.
- To transformers and beyond: Large language models for the genome. arXiv preprint arXiv:2311.07621, 2023.
- Genome-wide identification and analysis of the maize serine peptidase s8 family genes in response to drought at seedling stage. Plants, 12(2), 2023. ISSN 2223-7747. doi: 10.3390/plants12020369. URL https://www.mdpi.com/2223-7747/12/2/369.
- Transformer-xl: Attentive language models beyond a fixed-length context, 2019.
- The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, pp. 2023–01, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Complex-trait prediction in the era of big data. Trends in Genetics, 34(10):746–754, 2018.
- Gene-based microbiome representation enhances host phenotype classification. Msystems, 8(4):e00531–23, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Inferring microbiota functions from taxonomic genes: a review. Gigascience, 11:giab090, 2022.
- Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis. BMC Bioinformatics, 17(1):38, January 2016.
- Remodelling of the intestinal ecosystem during caloric restriction and fasting. Trends in Microbiology, 2023.
- A novel peptidoglycan binding protein crucial for pbp1a-mediated cell wall biogenesis in vibrio cholerae. PLOS Genetics, 10(6):1–14, 06 2014. doi: 10.1371/journal.pgen.1004433. URL https://doi.org/10.1371/journal.pgen.1004433.
- Advancing microbiome research with machine learning: key findings from the ml4microbiome cost action. Frontiers in Microbiology, 14, 2023.
- Population genomics and local adaptation in wild isolates of a model microbial eukaryote. Proceedings of the National Academy of Sciences, 108(7):2831–2836, 2011. doi: 10.1073/pnas.1014971108. URL https://www.pnas.org/doi/abs/10.1073/pnas.1014971108.
- Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 20(7):389–403, 2019.
- A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pp. 226–231. AAAI Press, 1996.
- progenomes3: approaching one million accurately and consistently annotated high-quality prokaryotic genomes. Nucleic acids research, 51(D1):D760–D766, 2023.
- Metagenomic assembly: Overview, challenges and applications. Yale J Biol Med, 89(3):353–362, September 2016.
- Analysis of gene-gene interactions. Curr Protoc Hum Genet, Chapter 1:Unit1.14, July 2011.
- Machine learning sheds light on microbial dark proteins. Nature Reviews Microbiology, pp. 1–1, 2023.
- Machine learning and deep learning applications in microbiome research. ISME Communications, 2(1):98, 2022.
- Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter. Nature communications, 13(1):2606, 2022.
- Learning inverse folding from millions of predicted structures. ICML, 2022. doi: 10.1101/2022.04.10.487779. URL https://www.biorxiv.org/content/early/2022/04/10/2022.04.10.487779.
- Supervised learning and model analysis with compositional data. PLOS Computational Biology, 19(6):e1011240, 2023.
- Single-cell rna sequencing technologies and bioinformatics pipelines. Experimental & molecular medicine, 50(8):1–14, 2018.
- Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11(1):119, Mar 2010. ISSN 1471-2105. doi: 10.1186/1471-2105-11-119. URL https://doi.org/10.1186/1471-2105-11-119.
- Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. bioRxiv, 2020. doi: 10.1101/2020.09.17.301879. URL https://www.biorxiv.org/content/early/2020/09/19/2020.09.17.301879.
- Hotspot: hierarchical host prediction for assembled plasmid contigs with transformer. Bioinformatics, 39(5):btad283, 2023.
- Single-cell rna sequencing technologies and applications: A brief overview. Clinical and Translational Medicine, 12(3):e694, 2022.
- Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, August 2021.
- KEGG as a reference resource for gene and protein annotation. Nucleic Acids Research, 44(D1):D457–D462, 10 2015. ISSN 0305-1048. doi: 10.1093/nar/gkv1070. URL https://doi.org/10.1093/nar/gkv1070.
- Genetic determinants of gut microbiota composition and bile acid profiles in mice. PLoS Genetics, 15(8):e1008073, 2019.
- Best practices for analysing microbiomes. Nature Reviews Microbiology, 16(7):410–422, 2018.
- Estimating the size of the bacterial pan-genome. Trends in genetics, 25(3):107–110, 2009.
- Comparison of inhibitory activity of bioactive molecules on the dextransucrase from streptococcus mutans. Applied Microbiology and Biotechnology, 99(18):7495–7503, September 2015.
- Improved prediction of bacterial genotype-phenotype associations using interpretable pangenome-spanning regressions. MBio, 11(4):10–1128, 2020.
- Li, H. Microbiome, metagenomics, and high-dimensional compositional data analysis. Annual Review of Statistics and Its Application, 2:73–94, 2015.
- Genomic interpreter: A hierarchical genomic deep neural network with 1d shifted window transformer, 2023.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
- Predicting rna-seq coverage from dna sequence as a unifying model of gene regulation. bioRxiv, 2023. doi: 10.1101/2023.08.30.555582. URL https://www.biorxiv.org/content/early/2023/09/01/2023.08.30.555582.
- Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
- Strains, functions and dynamics in the expanded human microbiome project. Nature, 550(7674):61–66, 2017.
- Decoupled weight decay regularization, 2019.
- Toward microbiome engineering: Expanding the repertoire of genetically tractable members of the human gut microbiome. Annual Review of Microbiology, 77, 2023.
- Extreme genome reduction in symbiotic bacteria. Nature Reviews Microbiology, 10(1):13–26, January 2012.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv, 2021. doi: 10.1101/2021.07.09.450648. URL https://www.biorxiv.org/content/10.1101/2021.07.09.450648v1.
- progenomes: a resource for consistent functional and taxonomic annotations of prokaryotic genomes. Nucleic Acids Res, 45(D1):D529–D534, October 2016.
- proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Research, 48(D1):D621–D625, 10 2019. ISSN 0305-1048. doi: 10.1093/nar/gkz1002. URL https://doi.org/10.1093/nar/gkz1002.
- progenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic acids research, 48(D1):D621–D625, 2020.
- Pfam: The protein families database in 2021. Nucleic acids research, 49(D1):D412–D419, 2021.
- Moreno-Gámez, S. How bacteria navigate varying environments. Science, 378(6622):845–845, 2022. doi: 10.1126/science.adf4444. URL https://www.science.org/doi/abs/10.1126/science.adf4444.
- A genomic catalog of earth’s microbiomes. Nature biotechnology, 39(4):499–509, 2021.
- Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. arXiv preprint arXiv:2306.15794, 2023.
- Solid phase dna minisequencing by an enzymatic luminometric inorganic pyrophosphate detection assay. Analytical Biochemistry, 208(1):171–175, 1993. ISSN 0003-2697. doi: https://doi.org/10.1006/abio.1993.1024. URL https://www.sciencedirect.com/science/article/pii/S0003269783710249.
- Gtdb: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic acids research, 50(D1):D785–D794, 2022.
- Unraveling the functional dark matter through global metagenomics. Nature, 622(7983):594–602, 2023.
- Genome-wide screen identifies host colonization determinants in a bacterial gut symbiont. Proceedings of the National Academy of Sciences, 113(48):13887–13892, 2016a. doi: 10.1073/pnas.1610856113. URL https://www.pnas.org/doi/abs/10.1073/pnas.1610856113.
- Genome-wide screen identifies host colonization determinants in a bacterial gut symbiont. Proceedings of the National Academy of Sciences, 113(48):13887–13892, 2016b.
- Compressive transformers for long-range sequence modelling, 2019.
- Msa transformer. bioRxiv, 2021. doi: 10.1101/2021.02.12.430858. URL https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1.
- Transformer protein language models are unsupervised structure learners. bioRxiv, 2020. doi: 10.1101/2020.12.15.422761. URL https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1.
- Utilization of the microbiome in personalized medicine. Nature Reviews Microbiology, pp. 1–18, 2023.
- Mgnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Research, 51(D1):D753–D759, 2023.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS, 2019. doi: 10.1101/622803. URL https://www.biorxiv.org/content/10.1101/622803v4.
- Genet: Deep representations for metagenomics, 2019.
- Real-time DNA sequencing using detection of pyrophosphate release. Anal Biochem, 242(1):84–89, November 1996.
- The bacterial pangenome as a new tool for analysing pathogenic bacteria. New microbes and new infections, 7:72–85, 2015.
- Translocations activating IRF4 identify a subtype of germinal center-derived b-cell lymphoma affecting predominantly children and young adults. Blood, 118(1):139–147, April 2011.
- Current progress and open challenges for applying deep learning across the biosciences. Nature Communications, 13(1):1728, 2022.
- Spire: a searchable, planetary-scale microbiome resource. Nucleic Acids Research, 52(D1):D777–D783, 2024.
- The promise of the gut microbiome as part of individualized treatment strategies. Nature Reviews Gastroenterology & Hepatology, 19(1):7–25, 2022.
- Microbial diversity in extreme environments. Nature Reviews Microbiology, 20(4):219–235, April 2022.
- Identification of population bottlenecks and colonization factors during assembly of bacterial communities within the zebrafish intestine. MBio, 6(6):10–1128, 2015.
- Adaptive attention span in transformers, 2019.
- The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res, 51(D1):D638–D646, January 2023.
- Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=qVyeW-grC2k.
- Transfer learning enables predictions in network biology. Nature, 618(7965):616–624, June 2023.
- Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605, 2008. URL http://jmlr.org/papers/v9/vandermaaten08a.html.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Boost: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. The American Journal of Human Genetics, 87(3):325–340, 2010. ISSN 0002-9297. doi: https://doi.org/10.1016/j.ajhg.2010.07.021. URL https://www.sciencedirect.com/science/article/pii/S0002929710003782.
- Annoview enables large-scale analysis, comparison, and visualization of microbial gene neighborhoods. bioRxiv, pp. 2024–01, 2024.
- From genomes to phenotypes: Traitar, the microbial trait analyzer. mSystems, 1(6):10.1128/msystems.00101–16, 2016a. doi: 10.1128/msystems.00101-16. URL https://journals.asm.org/doi/abs/10.1128/msystems.00101-16.
- From genomes to phenotypes: Traitar, the microbial trait analyzer. MSystems, 1(6):e00101–16, 2016b.
- Machine learning identifies signatures of host adaptation in the bacterial pathogen salmonella enterica. PLOS Genetics, 14(5):1–20, 05 2018. doi: 10.1371/journal.pgen.1007333. URL https://doi.org/10.1371/journal.pgen.1007333.
- Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3):R46, March 2014.
- Improved metagenomic analysis with kraken 2. Genome biology, 20:1–13, 2019.
- Synchronization of stochastic expressions drives the clustering of functionally related genes. Science Advances, 5(10):eaax6525, 2019. doi: 10.1126/sciadv.aax6525. URL https://www.science.org/doi/abs/10.1126/sciadv.aax6525.
- Evolink: a phylogenetic approach for rapid identification of genotype–phenotype associations in large-scale microbial multispecies data. Bioinformatics, 39(5):btad215, 2023.
- Large-scale metagenome assembly reveals novel animal-associated microbial genomes, biosynthetic gene clusters, and other genetic diversity. mSystems, 5(6):10.1128/msystems.01045–20, 2020. doi: 10.1128/msystems.01045-20. URL https://journals.asm.org/doi/abs/10.1128/msystems.01045-20.
- Big bird: Transformers for longer sequences, 2021.
- A review and tutorial of machine learning methods for microbiome host trait prediction. Frontiers in genetics, 10:579, 2019.
- Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006, 2023.