A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings (1607.08342v1)

Published 28 Jul 2016 in cs.DS

Abstract: Indexing of very large collections of strings such as those produced by the widespread sequencing technologies, heavily relies on multi-string generalizations of the Burrows-Wheeler Transform (BWT), and for this problem various in-memory algorithms have been proposed. The rapid growing of data that are processed routinely, such as in bioinformatics, requires a large amount of main memory, and this fact has motivated the development of algorithms, to compute the BWT, that work almost entirely in external memory. On the other hand, the related problem of computing the Longest Common Prefix (LCP) array is often instrumental in several algorithms on collection of strings, such as those that compute the suffix-prefix overlap among strings, which is an essential step for many genome assembly algorithms. The best current lightweight approach to compute BWT and LCP array on a set of $m$ strings, each one $k$ characters long, has I/O complexity that is $O(mk² \log |\Sigma|)$ (where $|\Sigma|$ is the size of the alphabet), thus it is not optimal. In this paper we propose a novel approach to build BWT and LCP array (simultaneously) with $O(kmL(\log k +\log \sigma))$ I/O complexity, where $L$ is the length of longest substring that appears at least twice in the input strings.

Authors (5)

Paola Bonizzoni (24 papers)
Gianluca Della Vedova (15 papers)
Serena Nicosia (1 paper)
Marco Previtali (6 papers)
Raffaella Rizzi (7 papers)

Citations (3)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings (1607.08342v1)

Summary

Related Papers