PhyloGFN: Phylogenetic inference with generative flow networks (2310.08774v2)
Abstract: Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history and numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets. PhyloGFN is competitive with prior works in marginal likelihood estimation and achieves a closer fit to the target distribution than state-of-the-art variational inference methods. Our code is available at https://github.com/zmy1116/phylogfn.
- Evolution and ecology of antibiotic resistance genes. FEMS microbiology letters, 271(2):147–161, 2007.
- Flow network based generative models for non-iterative diverse candidate generation. Neural Information Processing Systems (NeurIPS), 2021.
- GFlowNet foundations. Journal of Machine Learning Research, (24):1–76, 2023.
- Aligning multiple genomic sequences with the threaded blockset aligner. Genome research, 14(4):708–715, 2004.
- Importance weighted autoencoders. International Conference on Learning Representations (ICLR), 2016.
- Consurf: using evolutionary data to raise testable hypotheses about protein function. Israel Journal of Chemistry, 53(3-4):199–206, 2013.
- Maximum likelihood of evolutionary trees is hard. In Annual International Conference on Research in Computational Molecular Biology, pp. 296–310. Springer, 2005.
- William HE Day. Computational complexity of inferring phylogenies from dissimilarity matrices. Bulletin of mathematical biology, 49(4):461–467, 1987.
- Bayesian structure learning with generative flow networks. Uncertainty in Artificial Intelligence (UAI), 2022.
- Joint Bayesian inference of graphical structure and parameters with a single generative flow network. arXiv preprint arXiv:2305.19366, 2023.
- Joseph Felsenstein. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Systematic Biology, 22(3):240–249, 1973.
- Molecular evidence for acanthocephala as a subtaxon of rotifera. J Mol Evol, 43(3):287–292, September 1996.
- Genomic biosurveillance detects a sexual hybrid in the sudden oak death pathogen. Communications Biology, 5(1):477, 2022.
- Tetrapod phylogeny inferred from 18S and 28S ribosomal RNA sequences and a review of the evidence for amniote relationships. Molecular Biology and Evolution, 7(6):607–633, 11 1990.
- Laboulbeniopsis termitarius, an ectoparasite of termites newly recognized as a member of the laboulbeniomycetes. Mycologia, 95(4):561–564, July 2003.
- GFlowNet-EM for learning compositional latent variable models. International Conference on Machine Learning (ICML), 2023.
- RevBayes: Bayesian Phylogenetic Inference Using Graphical Models and an Interactive Model-Specification Language. Systematic Biology, 65(4):726–736, 05 2016. ISSN 1063-5157. doi: 10.1093/sysbio/syw021. URL https://doi.org/10.1093/sysbio/syw021.
- Biological sequence design with GFlowNets. International Conference on Machine Learning (ICML), 2022.
- VaiPhy: a variational inference based algorithm for phylogeny. Neural Information Processing Systems (NeurIPS), 2022.
- A theory of continuous generative flow networks. International Conference on Machine Learning (ICML), 2023.
- Eleven grand challenges in single-cell data science. Genome biology, 21(1):1–35, 2020.
- Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. Systematic biology, 57(1):86–103, 2008.
- Supervised learning on phylogenetically distributed data. Bioinformatics, 36(Supplement_2):i895–i902, 2020.
- Reconstructing contiguous regions of an ancestral genome. Genome research, 16(12):1557–1565, 2006.
- Trajectory balance: Improved credit assignment in GFlowNets. Neural Information Processing Systems (NeurIPS), 2022.
- GFlowNets and variational inference. International Conference on Learning Representations (ICLR), 2023.
- Geophy: Differentiable phylogenetic inference via geometric gradients of tree topologies. arXiv preprint arXiv:2307.03675, 2023.
- Variational combinatorial sequential Monte Carlo methods for Bayesian phylogenetic inference. Uncertainty in Artificial Intelligence (UAI), 2021.
- Phylogenetic analysis and antimicrobial resistance profiles of escherichia coli strains isolated from uti-suspected patients. Iranian Journal of Public Health, 49(9):1743, 2020.
- MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Systematic biology, 61(3):539–542, 2012.
- Molecular studies of the bionectriaceae using large subunit rdna sequences. Mycologia, 93(1):100–110, 2001.
- David L Swofford. Phylogenetic analysis using parsimony. 1998.
- Mode jumping proposals in mcmc. Scandinavian journal of statistics, 28(1):205–223, 2001.
- Attention is all you need. Neural Information Processing Systems (NIPS), 2017.
- Comparison of Likelihood and Bayesian Methods for Estimating Divergence Times Using Multiple Gene Loci and Calibration Points, with Application to a Radiation of Cute-Looking Mouse Lemur Species. Systematic Biology, 52(5):705–716, 10 2003.
- Divergence dates for malagasy lemurs estimated from multiple gene loci: geological and evolutionary context. Molecular Ecology, 13(4):757–773, 2004.
- Cheng Zhang. Learnable topological features for phylogenetic inference via graph neural networks. arXiv preprint arXiv:2302.08840, 2023.
- Generalizing tree probability estimation via bayesian networks. Neural Information Processing Systems (NIPS), 2018a.
- Variational bayesian phylogenetic inference. International Conference on Learning Representations (ICLR), 2018b.
- Robust scheduling with GFlowNets. International Conference on Learning Representations (ICLR), 2023a.
- Generative flow networks for discrete probabilistic modeling. International Conference on Machine Learning (ICML), 2022.
- Let the flows tell: Solving graph combinatorial problems with GFlowNets. arXiv preprint arXiv:2305.17010, 2023b.
- Molecular phylogeny of dogwood anthracnose fungus (discula destructiva) and the diaporthales. Mycologia, 93(2):355–365, 2001.
- A variational perspective on generative flow networks. Transactions on Machine Learning Research (TMLR), 2023.