Exploring gene content with pangene graphs (2402.16185v3)

Published 25 Feb 2024 in q-bio.GN

Abstract: Motivation: The gene content regulates the biology of an organism. It varies between species and between individuals of the same species. Although tools have been developed to identify gene content changes in bacterial genomes, none is applicable to collections of large eukaryotic genomes such as the human pangenome. Results: We developed pangene, a computational tool to identify gene orientation, gene order and gene copy-number changes in a collection of genomes. Pangene aligns a set of input protein sequences to the genomes, resolves redundancies between protein sequences and constructs a gene graph with each genome represented as a walk in the graph. It additionally finds subgraphs, which we call bibubbles, that capture gene content changes. Applied to the human pangenome, pangene identifies known gene-level variations and reveals complex haplotypes that are not well studied before. Pangene also works with high-quality bacterial pangenome and reports similar numbers of core and accessory genes in comparison to existing tools. Availability and implementation: Source code at https://github.com/lh3/pangene; pre-built pangene graphs can be downloaded from https://zenodo.org/records/8118576 and visualized at https://pangene.bioinweb.org

References (48)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces pangene, a novel computational method that uses bidirected graphs to capture variations in gene orientation, order, and copy number.
The paper applies pangene to the human pangenome, accurately identifying polymorphic regions and clinically relevant gene variations.
The paper demonstrates pangene’s flexibility by also achieving robust results in bacterial genomes, matching outcomes from established tools.

Exploring Gene Content with Pangene Graphs

The paper presents a computational methodology named "pangene," developed for identifying variations in gene orientation, order, and copy number across large eukaryotic genomes. Importantly, while bacterial genomes have established tools for assessing gene content changes, analogous strategies for eukaryotic genomes like the human pangenome have been lacking. This paper argues that such a tool is particularly pertinent given the increased resolution and assembly capabilities offered by recent advances in sequencing technologies.

Methodology and Key Features

Pangene's methodology centers on aligning protein sequences to genomes and resolving redundancies to construct a gene graph. Notably, each genome is represented as a walk within this graph, encapsulating the nuanced variations. Pangene introduces the concept of "bibubbles," a novel approach to capturing local variations in gene structure, which includes orientation and copy number differences. The tool demonstrates its utility by applying this framework to the human pangenome, effectively identifying known gene variations and unveiling previously understudied haplotypes.

A distinctive aspect of the pangene is its use of bidirected graphs, differing from directed graphs traditionally used in genomic studies, thus allowing the natural representation of complexities such as inversions and segmental duplications. The algorithm employs protein-to-genome alignment using the miniprot algorithm, ensuring robustness to sequencing errors. Consequently, pangene can also accommodate bacterial genomes, providing comparable results in terms of core and accessory genes to those from established tools.

Results and Implications

When applied to datasets such as the Human Genome Reference Consortium samples, pangene identified polymorphic regions and gene-level variations with high fidelity. This included confirmation of known genomic disorders and traits, suggesting pangene's efficacy in identifying clinically and evolutionarily pertinent variations. The robustness of the tool was further exemplified in bacterial genome analysis, where it reported outcomes analogous to prevalent tools, demonstrating its flexibility across domains.

Pangene's implications span both theoretical insights and practical applications. Theoretically, it redefines how genomic variations between populations can be understood via graph-based approaches, supporting the view of genomes as dynamic networks rather than static sequences. Practically, it offers a scalable tool for investigating genomic structures, with potential impacts in personalized medicine, studying evolution and biodiversity, and advancing antimicrobial resistance research.

Challenges and Future Directions

While pangene provides a substantial framework, the paper acknowledges challenges, particularly regarding the precise modeling of complex genomic phenomena over evolutionary timescales. The current algorithm relies on specific heuristics, leaving scope for optimization and expansion of applicable domains. Future work might aim at formulating a global optimization problem to enhance pangenomic graph construction, potentially improving precision in capturing structural variants.

Moreover, integrating pangene with broader genomic data sources could unravel longer evolutionary narratives across species. The method's adaptability to cross-species datasets will depend heavily on improving input sets and alignment strategies to reduce noise within complex assemblages. Furthermore, the identification of generalized bibubbles remains a topic for further exploration, particularly in more extensive and heterogeneous genetic datasets.

Conclusion

Overall, the pangene approach is poised to fill a critical gap in genomic studies, facilitating a higher-order understanding of gene content variation. Its inception provides both a tool for immediate application in pangenomics as well as an intriguing conceptual framework for further research in genomic variation and evolutionary biology. The development and implementation of pangene underscore the continuing evolution of computational genomics, bearing implications as extensive as its application.