DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies (1410.2801v4)

Published 10 Oct 2014 in q-bio.GN

Abstract: (An updated version of this manuscript has been accepted to Scientific Reports in 2016, please refer to http://www.nature.com/articles/srep31900) The highly anticipated transition from next generation sequencing (NGS) to third generation sequencing (3GS) has been difficult primarily due to high error rates and excessive sequencing cost. The high error rates make the assembly of long erroneous reads of large genomes challenging because existing software solutions are often overwhelmed by error correction tasks. Here we report a hybrid assembly approach that simultaneously utilizes NGS and 3GS data to address both issues. We gain advantages from three general and basic design principles: (i) Compact representation of the long reads lead to efficient alignments. (ii) Base-level errors can be skipped; structural errors need to be detected and corrected. (iii) Structurally correct 3GS reads are assembled and polished. In our implementation, preassembled NGS contigs are used to derive the compact representation of the long reads, which established an algorithmic conversion from a de Bruijn graph to an overlap graph, the two major assembly paradigms. Moreover, since NGS and 3GS data can compensate each other, our hybrid assembly approach reduces both of their sequencing requirements. Experiments show that our software is able to assemble mammalian-sized genomes orders of magnitude more efficiently in time than existing methods, while saving about half of the sequencing cost.

Citations (254)

View on Semantic Scholar

Summary

The paper introduces a hybrid assembly method that combines high-accuracy NGS contigs with long 3GS reads to effectively overcome sequencing errors.
It details an innovative algorithm that converts de Bruijn graphs to overlap graphs using lossy compression for efficient error handling and reduced computational burden.
Experimental results demonstrate the method assembles a human genome in 3 CPU days, significantly lowering resource demands compared to traditional approaches.

Insightful Overview of #DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

The paper primarily addresses the challenges associated with assembling large genomes using long erroneous reads from third-generation sequencing (3GS) technologies. Third-generation platforms like PacBio and Oxford Nanopore offer substantial improvements in read length over second-generation technologies, but high error rates and sequencing costs have impeded their widespread adoption. The proposed hybrid assembly approach, #DBG2OLC, leverages the complementary strengths of 3GS and next-generation sequencing (NGS) data to enhance assembly efficiency.

Key Contributions and Methodological Innovations

Hybrid Assembly Approach: The paper introduces a hybrid method that combines the high accuracy of NGS data with the lengthy reads of 3GS technologies. This approach mitigates sequencing errors inherent in 3GS reads by utilizing error-corrected contigs derived from NGS data.
Algorithmic Conversion from de Bruijn Graphs to Overlap Graphs: A significant innovation of this paper is constructing an algorithm that transitions from a de Bruijn Graph (DBG) representation typical of NGS to an overlap graph more suitable for the longer reads provided by 3GS. This conversion enables the advantages of both paradigms to be harnessed.
Efficient Data Compression and Error Handling: The method leverages a lossy compression technique that maps 3GS reads to NGS-derived contig identifiers. This drastically reduces the data size and computational burden, facilitating efficient overlaps and alignments. Handling structural errors is prioritized over base-level errors.
Computational and Cost Efficiency: The approach reduces the need for extensive sequencing coverage, lowering both time and cost requirements compared to traditional methods. Experiments demonstrated orders of magnitude improvements in computational efficiency for large mammalian genomes.

Experimental Results and Numerical Insights

The experimental evaluation of #DBG2OLC includes assembling various genomes, ranging from small yeast genomes to large human mammalian genomes. The paper reports that the pipeline can assemble a 3 Gbp human genome in just 3 CPU days using 30x 3GS and 50x NGS data, a stark contrast to the hundreds of thousands of CPU hours required by existing 3GS-only methods. For example, a draft assembly with an N50 of 6 Mbp was achieved with significantly fewer resources, showcasing its practicality in resource-constrained environments.

When tested on a yeast dataset, #DBG2OLC displayed superior performance with fewer structural errors and high contiguity, suggesting its robustness and accuracy compared to established assemblers like HGAP, Falcon, and PacBioToCA. The software also performed competently on Oxford Nanopore data, successfully assembling an E. coli genome with less than 0.23% error.

Theoretical and Practical Implications

The strategy of utilizing compressed read overlaps offers theoretical advancements by effectively bridging the methodologies of DBG and OLC paradigms. Practically, the reduction in computational demand makes large-scale genome assemblies more accessible, aligning with the increasing throughput capabilities of sequencing technologies.

The success of #DBG2OLC underscores the potential in hybrid methods for genome assembly as 3GS technologies become more prevalent. By alleviating error correction burdens and reducing sequencing demands, this method supports the transition towards more comprehensive genomic analyses and applications in various fields, including biotechnology and medical research.

Future Directions

The ongoing evolution and declining costs of sequencing platforms will likely amplify the significance of methods like #DBG2OLC. Future advancements may include improved error detection algorithms and integration with emerging sequencing technologies. This work quantitatively illustrates that addressing structural sequencing errors instead of exhaustive base-level corrections can substantially enhance practical genome assembly outcomes, paving the way for further improvements in assembly accuracy and efficiency.