Exploring gene content with pangene graphs (2402.16185v3)
Abstract: Motivation: The gene content regulates the biology of an organism. It varies between species and between individuals of the same species. Although tools have been developed to identify gene content changes in bacterial genomes, none is applicable to collections of large eukaryotic genomes such as the human pangenome. Results: We developed pangene, a computational tool to identify gene orientation, gene order and gene copy-number changes in a collection of genomes. Pangene aligns a set of input protein sequences to the genomes, resolves redundancies between protein sequences and constructs a gene graph with each genome represented as a walk in the graph. It additionally finds subgraphs, which we call bibubbles, that capture gene content changes. Applied to the human pangenome, pangene identifies known gene-level variations and reveals complex haplotypes that are not well studied before. Pangene also works with high-quality bacterial pangenome and reports similar numbers of core and accessory genes in comparison to existing tools. Availability and implementation: Source code at https://github.com/lh3/pangene; pre-built pangene graphs can be downloaded from https://zenodo.org/records/8118576 and visualized at https://pangene.bioinweb.org
- Structural haplotypes and recent evolution of the human 17q21.31 region. Nat Genet, 44:881–5.
- Linear-time superbubble identification algorithm for genome assembly. Theor. Comput. Sci., 609:374–383.
- Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods, 18:170–175.
- BubbleGun: enumerating bubbles and superbubbles in genome graphs. Bioinformatics, 38:4217–4219.
- panx: pan-genome analysis and exploration. Nucleic Acids Res, 46:e5.
- Fu, L. et al. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28:3150–2.
- Building pangenome graphs. bioRxiv.
- Superbubbles revisited. Algorithms Mol Biol, 13:16.
- Direct superbubble detection. Algorithms, 12:81.
- Gautreau, G. et al. (2020). PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLoS Comput Biol, 16:e1007732.
- Evaluation of Nanopore sequencing for Mycobacterium tuberculosis drug susceptibility testing and outbreak investigation: a genomic analysis. Lancet Microbe, 4:e84–e92.
- Handsaker, R. E. et al. (2015). Large multiallelic copy number variations in humans. Nat Genet, 47:296–303.
- T2T-YAO: A telomere-to-telomere assembled diploid reference genome for han chinese. Genomics Proteomics Bioinformatics.
- Pangenome graph construction from genome alignments with minigraph-cactus. Nat Biotechnol.
- Hyatt, D. et al. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11:119.
- The program structure tree: Computing control regions in linear time. In Sarkar, V., Ryder, B. G., and Soffa, M. L., editors, Proceedings of the ACM SIGPLAN’94 Conference on Programming Language Design and Implementation (PLDI), Orlando, Florida, USA, June 20-24, 1994, pages 171–185. ACM.
- The hominoid-specific gene TBC1D3 promotes generation of basal neural progenitors and induces cortical folding in mice. Elife, 5.
- Li, H. (2023). Protein-to-genome alignment with miniprot. Bioinformatics, 39:btad014.
- Li, H. et al. (2020). The design and construction of reference pangenome graphs with minigraph. Genome Biol, 21:265.
- CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22:1658–9.
- A draft human pangenome reference. Nature, 617:312–324.
- The complete sequence and comparative analysis of ape sex chromosomes. bioRxiv.
- Marin, M. et al. (2022). Benchmarking the empirical accuracy of short-read sequencing across the m. tuberculosis genome. Bioinformatics, 38:1781–1787.
- Spinal muscular atrophy. Nat Rev Dis Primers, 8:52.
- Nurk, S. et al. (2022). The complete sequence of a human genome. Science, 376:44–53.
- HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res, 30:1291–1305.
- Detecting superbubbles in assembly graphs. In WABI, pages 338–348.
- Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics, 31:3691–3.
- Superbubbles, ultrabubbles, and cacti. J Comput Biol, 25:649–663.
- Evaluation of whole-genome sequence data analysis approaches for short- and long-read sequencing of mycobacterium tuberculosis. Microb Genom, 7.
- Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat Biotechnol.
- Schneider, V. A. et al. (2017). Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res, 27:849–864.
- Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30:2068–9.
- Shaw, L. P. et al. (2021). Niche and local geography shape the pangenome of wastewater- and livestock-associated Enterobacteriaceae. Sci Adv, 7.
- Structural diversity and african origin of the 17q21.31 inversion polymorphism. Nat Genet, 44:872–80.
- Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol, 35:1026–1028.
- Sudmant, P. H. et al. (2010). Diversity of human copy number variation and multicopy genes. Science, 330:641–6.
- An o(m log m)-time algorithm for detecting superbubbles. IEEE/ACM Trans Comput Biol Bioinform, 12:770–7.
- A review of the important role of CYP2D6 in pharmacogenomics. Genes (Basel), 11.
- Challenges in prokaryote pangenomics. Microb Genom, 9.
- Tonkin-Hill, G. et al. (2020). Producing polished prokaryotic pangenomes with the panaroo pipeline. Genome Biol, 21:180.
- Increased mutation and gene conversion within human segmental duplications. Nature, 617:325–334.
- Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol, 37:1155–1162.
- Bandage: interactive visualization of de novo genome assemblies. Bioinformatics, 31:3350–2.
- The complete and fully-phased diploid genome of a male han chinese. Cell Res, 33:745–761.
- Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res, 18:821–9.
- Full resolution HLA and KIR genes annotation for human genome assemblies. bioRxiv.
- Accurate reconstruction of bacterial pan- and core genomes with PEPPAN. Genome Res, 30:1667–1679.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.