Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PhyloGFN: Phylogenetic inference with generative flow networks (2310.08774v2)

Published 12 Oct 2023 in q-bio.PE, cs.LG, and stat.ML

Abstract: Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history and numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets. PhyloGFN is competitive with prior works in marginal likelihood estimation and achieves a closer fit to the target distribution than state-of-the-art variational inference methods. Our code is available at https://github.com/zmy1116/phylogfn.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Evolution and ecology of antibiotic resistance genes. FEMS microbiology letters, 271(2):147–161, 2007.
  2. Flow network based generative models for non-iterative diverse candidate generation. Neural Information Processing Systems (NeurIPS), 2021.
  3. GFlowNet foundations. Journal of Machine Learning Research, (24):1–76, 2023.
  4. Aligning multiple genomic sequences with the threaded blockset aligner. Genome research, 14(4):708–715, 2004.
  5. Importance weighted autoencoders. International Conference on Learning Representations (ICLR), 2016.
  6. Consurf: using evolutionary data to raise testable hypotheses about protein function. Israel Journal of Chemistry, 53(3-4):199–206, 2013.
  7. Maximum likelihood of evolutionary trees is hard. In Annual International Conference on Research in Computational Molecular Biology, pp.  296–310. Springer, 2005.
  8. William HE Day. Computational complexity of inferring phylogenies from dissimilarity matrices. Bulletin of mathematical biology, 49(4):461–467, 1987.
  9. Bayesian structure learning with generative flow networks. Uncertainty in Artificial Intelligence (UAI), 2022.
  10. Joint Bayesian inference of graphical structure and parameters with a single generative flow network. arXiv preprint arXiv:2305.19366, 2023.
  11. Joseph Felsenstein. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Systematic Biology, 22(3):240–249, 1973.
  12. Molecular evidence for acanthocephala as a subtaxon of rotifera. J Mol Evol, 43(3):287–292, September 1996.
  13. Genomic biosurveillance detects a sexual hybrid in the sudden oak death pathogen. Communications Biology, 5(1):477, 2022.
  14. Tetrapod phylogeny inferred from 18S and 28S ribosomal RNA sequences and a review of the evidence for amniote relationships. Molecular Biology and Evolution, 7(6):607–633, 11 1990.
  15. Laboulbeniopsis termitarius, an ectoparasite of termites newly recognized as a member of the laboulbeniomycetes. Mycologia, 95(4):561–564, July 2003.
  16. GFlowNet-EM for learning compositional latent variable models. International Conference on Machine Learning (ICML), 2023.
  17. RevBayes: Bayesian Phylogenetic Inference Using Graphical Models and an Interactive Model-Specification Language. Systematic Biology, 65(4):726–736, 05 2016. ISSN 1063-5157. doi: 10.1093/sysbio/syw021. URL https://doi.org/10.1093/sysbio/syw021.
  18. Biological sequence design with GFlowNets. International Conference on Machine Learning (ICML), 2022.
  19. VaiPhy: a variational inference based algorithm for phylogeny. Neural Information Processing Systems (NeurIPS), 2022.
  20. A theory of continuous generative flow networks. International Conference on Machine Learning (ICML), 2023.
  21. Eleven grand challenges in single-cell data science. Genome biology, 21(1):1–35, 2020.
  22. Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. Systematic biology, 57(1):86–103, 2008.
  23. Supervised learning on phylogenetically distributed data. Bioinformatics, 36(Supplement_2):i895–i902, 2020.
  24. Reconstructing contiguous regions of an ancestral genome. Genome research, 16(12):1557–1565, 2006.
  25. Trajectory balance: Improved credit assignment in GFlowNets. Neural Information Processing Systems (NeurIPS), 2022.
  26. GFlowNets and variational inference. International Conference on Learning Representations (ICLR), 2023.
  27. Geophy: Differentiable phylogenetic inference via geometric gradients of tree topologies. arXiv preprint arXiv:2307.03675, 2023.
  28. Variational combinatorial sequential Monte Carlo methods for Bayesian phylogenetic inference. Uncertainty in Artificial Intelligence (UAI), 2021.
  29. Phylogenetic analysis and antimicrobial resistance profiles of escherichia coli strains isolated from uti-suspected patients. Iranian Journal of Public Health, 49(9):1743, 2020.
  30. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Systematic biology, 61(3):539–542, 2012.
  31. Molecular studies of the bionectriaceae using large subunit rdna sequences. Mycologia, 93(1):100–110, 2001.
  32. David L Swofford. Phylogenetic analysis using parsimony. 1998.
  33. Mode jumping proposals in mcmc. Scandinavian journal of statistics, 28(1):205–223, 2001.
  34. Attention is all you need. Neural Information Processing Systems (NIPS), 2017.
  35. Comparison of Likelihood and Bayesian Methods for Estimating Divergence Times Using Multiple Gene Loci and Calibration Points, with Application to a Radiation of Cute-Looking Mouse Lemur Species. Systematic Biology, 52(5):705–716, 10 2003.
  36. Divergence dates for malagasy lemurs estimated from multiple gene loci: geological and evolutionary context. Molecular Ecology, 13(4):757–773, 2004.
  37. Cheng Zhang. Learnable topological features for phylogenetic inference via graph neural networks. arXiv preprint arXiv:2302.08840, 2023.
  38. Generalizing tree probability estimation via bayesian networks. Neural Information Processing Systems (NIPS), 2018a.
  39. Variational bayesian phylogenetic inference. International Conference on Learning Representations (ICLR), 2018b.
  40. Robust scheduling with GFlowNets. International Conference on Learning Representations (ICLR), 2023a.
  41. Generative flow networks for discrete probabilistic modeling. International Conference on Machine Learning (ICML), 2022.
  42. Let the flows tell: Solving graph combinatorial problems with GFlowNets. arXiv preprint arXiv:2305.17010, 2023b.
  43. Molecular phylogeny of dogwood anthracnose fungus (discula destructiva) and the diaporthales. Mycologia, 93(2):355–365, 2001.
  44. A variational perspective on generative flow networks. Transactions on Machine Learning Research (TMLR), 2023.
Citations (11)

Summary

We haven't generated a summary for this paper yet.