Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SparseAssembler: de novo Assembly with the Sparse de Bruijn Graph (1106.2603v1)

Published 14 Jun 2011 in cs.DS and q-bio.GN

Abstract: de Bruijn graph-based algorithms are one of the two most widely used approaches for de novo genome assembly. A major limitation of this approach is the large computational memory space requirement to construct the de Bruijn graph, which scales with k-mer length and total diversity (N) of unique k-mers in the genome expressed in base pairs or roughly (2k+8)N bits. This limitation is particularly important with large-scale genome analysis and for sequencing centers that simultaneously process multiple genomes. We present a sparse de Bruijn graph structure, based on which we developed SparseAssembler that greatly reduces memory space requirements. The structure also allows us to introduce a novel method for the removal of substitution errors introduced during sequencing. The sparse de Bruijn graph structure skips g intermediate k-mers, therefore reducing the theoretical memory space requirement to ~(2k/g+8)N. We have found that a practical value of g=16 consumes approximately 10% of the memory required by standard de Bruijn graph-based algorithms but yields comparable results. A high error rate could potentially derail the SparseAssembler. Therefore, we developed a sparse de Bruijn graph-based denoising algorithm that can remove more than 99% of substitution errors from datasets with a \leq 2% error rate. Given that substitution error rates for the current generation of sequencers is lower than 1%, our denoising procedure is sufficiently effective to safeguard the performance of our algorithm. Finally, we also introduce a novel Dijkstra-like breadth-first search algorithm for the sparse de Bruijn graph structure to circumvent residual errors and resolve polymorphisms.

Citations (8)

Summary

We haven't generated a summary for this paper yet.