Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 157 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 218 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Exploring gene content with pangene graphs (2402.16185v3)

Published 25 Feb 2024 in q-bio.GN

Abstract: Motivation: The gene content regulates the biology of an organism. It varies between species and between individuals of the same species. Although tools have been developed to identify gene content changes in bacterial genomes, none is applicable to collections of large eukaryotic genomes such as the human pangenome. Results: We developed pangene, a computational tool to identify gene orientation, gene order and gene copy-number changes in a collection of genomes. Pangene aligns a set of input protein sequences to the genomes, resolves redundancies between protein sequences and constructs a gene graph with each genome represented as a walk in the graph. It additionally finds subgraphs, which we call bibubbles, that capture gene content changes. Applied to the human pangenome, pangene identifies known gene-level variations and reveals complex haplotypes that are not well studied before. Pangene also works with high-quality bacterial pangenome and reports similar numbers of core and accessory genes in comparison to existing tools. Availability and implementation: Source code at https://github.com/lh3/pangene; pre-built pangene graphs can be downloaded from https://zenodo.org/records/8118576 and visualized at https://pangene.bioinweb.org

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Structural haplotypes and recent evolution of the human 17q21.31 region. Nat Genet, 44:881–5.
  2. Linear-time superbubble identification algorithm for genome assembly. Theor. Comput. Sci., 609:374–383.
  3. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods, 18:170–175.
  4. BubbleGun: enumerating bubbles and superbubbles in genome graphs. Bioinformatics, 38:4217–4219.
  5. panx: pan-genome analysis and exploration. Nucleic Acids Res, 46:e5.
  6. Fu, L. et al. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28:3150–2.
  7. Building pangenome graphs. bioRxiv.
  8. Superbubbles revisited. Algorithms Mol Biol, 13:16.
  9. Direct superbubble detection. Algorithms, 12:81.
  10. Gautreau, G. et al. (2020). PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLoS Comput Biol, 16:e1007732.
  11. Evaluation of Nanopore sequencing for Mycobacterium tuberculosis drug susceptibility testing and outbreak investigation: a genomic analysis. Lancet Microbe, 4:e84–e92.
  12. Handsaker, R. E. et al. (2015). Large multiallelic copy number variations in humans. Nat Genet, 47:296–303.
  13. T2T-YAO: A telomere-to-telomere assembled diploid reference genome for han chinese. Genomics Proteomics Bioinformatics.
  14. Pangenome graph construction from genome alignments with minigraph-cactus. Nat Biotechnol.
  15. Hyatt, D. et al. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11:119.
  16. The program structure tree: Computing control regions in linear time. In Sarkar, V., Ryder, B. G., and Soffa, M. L., editors, Proceedings of the ACM SIGPLAN’94 Conference on Programming Language Design and Implementation (PLDI), Orlando, Florida, USA, June 20-24, 1994, pages 171–185. ACM.
  17. The hominoid-specific gene TBC1D3 promotes generation of basal neural progenitors and induces cortical folding in mice. Elife, 5.
  18. Li, H. (2023). Protein-to-genome alignment with miniprot. Bioinformatics, 39:btad014.
  19. Li, H. et al. (2020). The design and construction of reference pangenome graphs with minigraph. Genome Biol, 21:265.
  20. CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22:1658–9.
  21. A draft human pangenome reference. Nature, 617:312–324.
  22. The complete sequence and comparative analysis of ape sex chromosomes. bioRxiv.
  23. Marin, M. et al. (2022). Benchmarking the empirical accuracy of short-read sequencing across the m. tuberculosis genome. Bioinformatics, 38:1781–1787.
  24. Spinal muscular atrophy. Nat Rev Dis Primers, 8:52.
  25. Nurk, S. et al. (2022). The complete sequence of a human genome. Science, 376:44–53.
  26. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res, 30:1291–1305.
  27. Detecting superbubbles in assembly graphs. In WABI, pages 338–348.
  28. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics, 31:3691–3.
  29. Superbubbles, ultrabubbles, and cacti. J Comput Biol, 25:649–663.
  30. Evaluation of whole-genome sequence data analysis approaches for short- and long-read sequencing of mycobacterium tuberculosis. Microb Genom, 7.
  31. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat Biotechnol.
  32. Schneider, V. A. et al. (2017). Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res, 27:849–864.
  33. Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30:2068–9.
  34. Shaw, L. P. et al. (2021). Niche and local geography shape the pangenome of wastewater- and livestock-associated Enterobacteriaceae. Sci Adv, 7.
  35. Structural diversity and african origin of the 17q21.31 inversion polymorphism. Nat Genet, 44:872–80.
  36. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol, 35:1026–1028.
  37. Sudmant, P. H. et al. (2010). Diversity of human copy number variation and multicopy genes. Science, 330:641–6.
  38. An o(m log m)-time algorithm for detecting superbubbles. IEEE/ACM Trans Comput Biol Bioinform, 12:770–7.
  39. A review of the important role of CYP2D6 in pharmacogenomics. Genes (Basel), 11.
  40. Challenges in prokaryote pangenomics. Microb Genom, 9.
  41. Tonkin-Hill, G. et al. (2020). Producing polished prokaryotic pangenomes with the panaroo pipeline. Genome Biol, 21:180.
  42. Increased mutation and gene conversion within human segmental duplications. Nature, 617:325–334.
  43. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol, 37:1155–1162.
  44. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics, 31:3350–2.
  45. The complete and fully-phased diploid genome of a male han chinese. Cell Res, 33:745–761.
  46. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res, 18:821–9.
  47. Full resolution HLA and KIR genes annotation for human genome assemblies. bioRxiv.
  48. Accurate reconstruction of bacterial pan- and core genomes with PEPPAN. Genome Res, 30:1667–1679.
Citations (5)

Summary

  • The paper introduces pangene, a novel computational method that uses bidirected graphs to capture variations in gene orientation, order, and copy number.
  • The paper applies pangene to the human pangenome, accurately identifying polymorphic regions and clinically relevant gene variations.
  • The paper demonstrates pangene’s flexibility by also achieving robust results in bacterial genomes, matching outcomes from established tools.

Exploring Gene Content with Pangene Graphs

The paper presents a computational methodology named "pangene," developed for identifying variations in gene orientation, order, and copy number across large eukaryotic genomes. Importantly, while bacterial genomes have established tools for assessing gene content changes, analogous strategies for eukaryotic genomes like the human pangenome have been lacking. This paper argues that such a tool is particularly pertinent given the increased resolution and assembly capabilities offered by recent advances in sequencing technologies.

Methodology and Key Features

Pangene's methodology centers on aligning protein sequences to genomes and resolving redundancies to construct a gene graph. Notably, each genome is represented as a walk within this graph, encapsulating the nuanced variations. Pangene introduces the concept of "bibubbles," a novel approach to capturing local variations in gene structure, which includes orientation and copy number differences. The tool demonstrates its utility by applying this framework to the human pangenome, effectively identifying known gene variations and unveiling previously understudied haplotypes.

A distinctive aspect of the pangene is its use of bidirected graphs, differing from directed graphs traditionally used in genomic studies, thus allowing the natural representation of complexities such as inversions and segmental duplications. The algorithm employs protein-to-genome alignment using the miniprot algorithm, ensuring robustness to sequencing errors. Consequently, pangene can also accommodate bacterial genomes, providing comparable results in terms of core and accessory genes to those from established tools.

Results and Implications

When applied to datasets such as the Human Genome Reference Consortium samples, pangene identified polymorphic regions and gene-level variations with high fidelity. This included confirmation of known genomic disorders and traits, suggesting pangene's efficacy in identifying clinically and evolutionarily pertinent variations. The robustness of the tool was further exemplified in bacterial genome analysis, where it reported outcomes analogous to prevalent tools, demonstrating its flexibility across domains.

Pangene's implications span both theoretical insights and practical applications. Theoretically, it redefines how genomic variations between populations can be understood via graph-based approaches, supporting the view of genomes as dynamic networks rather than static sequences. Practically, it offers a scalable tool for investigating genomic structures, with potential impacts in personalized medicine, studying evolution and biodiversity, and advancing antimicrobial resistance research.

Challenges and Future Directions

While pangene provides a substantial framework, the paper acknowledges challenges, particularly regarding the precise modeling of complex genomic phenomena over evolutionary timescales. The current algorithm relies on specific heuristics, leaving scope for optimization and expansion of applicable domains. Future work might aim at formulating a global optimization problem to enhance pangenomic graph construction, potentially improving precision in capturing structural variants.

Moreover, integrating pangene with broader genomic data sources could unravel longer evolutionary narratives across species. The method's adaptability to cross-species datasets will depend heavily on improving input sets and alignment strategies to reduce noise within complex assemblages. Furthermore, the identification of generalized bibubbles remains a topic for further exploration, particularly in more extensive and heterogeneous genetic datasets.

Conclusion

Overall, the pangene approach is poised to fill a critical gap in genomic studies, facilitating a higher-order understanding of gene content variation. Its inception provides both a tool for immediate application in pangenomics as well as an intriguing conceptual framework for further research in genomic variation and evolutionary biology. The development and implementation of pangene underscore the continuing evolution of computational genomics, bearing implications as extensive as its application.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 7 tweets and received 286 likes.

Upgrade to Pro to view all of the tweets about this paper: