Computing all-vs-all MEMs in grammar-compressed text (2306.16815v1)
Abstract: We describe a compression-aware method to compute all-vs-all maximal exact matches (MEM) among strings of a repetitive collection $\mathcal{T}$. The key concept in our work is the construction of a fully-balanced grammar $\mathcal{G}$ from $\mathcal{T}$ that meets a property that we call \emph{fix-free}: the expansions of the nonterminals that have the same height in the parse tree form a fix-free set (i.e., prefix-free and suffix-free). The fix-free property allows us to compute the MEMs of $\mathcal{T}$ incrementally over $\mathcal{G}$ using a standard suffix-tree-based MEM algorithm, which runs on a subset of grammar rules at a time and does not decompress nonterminals. By modifying the locally-consistent grammar of Christiansen et al 2020., we show how we can build $\mathcal{G}$ from $\mathcal{T}$ in linear time and space. We also demonstrate that our MEM algorithm runs on top of $\mathcal{G}$ in $O(G +occ)$ time and uses $O(\log G(G+occ))$ bits, where $G$ is the grammar size, and $occ$ is the number of MEMs in $\mathcal{T}$. In the conclusions, we discuss how our idea can be modified to implement approximate pattern matching in compressed space.
- Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53–86, 2004.
- Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990.
- Oblivious string embeddings and edit distance approximations. In Proc. 17th Symposium on Discrete Algorithms (SODA), pages 792–801, 2006.
- PHONI: Streamed matching statistics with multi-genome references. In Proc. 21st Data Compression Conference (DCC), pages 193–202, 2021.
- Sublinear approximate string matching and biological applications. Algorithmica, 12(4):327–344, 1994.
- The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554–2576, 2005.
- Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms, 17(1):1–39, 2020.
- Improved grammar-based compressed indexes. In Proc. 19th SPIRE, pages 180–192, 2012.
- Grammar-compressed indexes with logarithmic search time. Journal of Computer and System Sciences, 118:53–74, 2021.
- Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms. In Proc. 18th Anual Symposium on Theory of Computing (STOC), pages 206–219, 1986.
- A grammar compressor for collections of reads with applications to the construction of the BWT. In Proc. 31st Data Compression Conference (DCC), pages 83–92, 2021.
- Johannes Fischer. Optimal succinctness for range minimum queries. In Proc. 9th Latin American Symposium, pages 158–169, 2010.
- Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1):article 2, 2020.
- Artur Jeż. Approximation of grammar-based compression via recompression. Theoretical Computer Science, 592:115–134, 2015.
- W. James Kent. BLAT—the BLAST-like alignment tool. Genome Research, 12(4):656–664, 2002.
- Grammar-based codes: a new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737–754, 2000.
- Versatile and open software for comparing large genomes. Genome Biology, 5:1–9, 2004.
- Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proc. 12th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 181–192, 2001.
- Fast gapped-read alignment with bowtie 2. Nature Methods, 9(4):357–359, 2012.
- Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997, 2013.
- Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, 2018.
- Genome-Scale Algorithm Design. Cambridge University Press, 2015.
- Suffix arrays: a new method for on–line string searches. SIAM Journal on Computing, 22(5):935–948, 1993.
- Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262–272, 1976.
- Gonzalo Navarro. Computing MEMs on repetitive text collections. In Proc. 34th Annual Symposium on Combinatorial Pattern Matching (CPM), page article 22, 2023.
- Linear suffix array construction by almost pure induced-sorting. In Proc. 19th Data Compression Conference (DCC), pages 193–202, 2009.
- A grammar compression algorithm based on induced suffix sorting. In Proc. 28th Data Compression Conference (DCC), pages 42–51, 2018.
- CST++. In Proc. 17th International Symposium on String Processing and Information Retrieval (SPIRE), pages 322–333, 2010.
- Finding maximal exact matches using the r-index. Journal of Computational Biology, 29(2):188–194, 2022.
- MONI: A pangenomic index for finding maximal exact matches. Journal of Computational Biology, 29(2):169–187, 2022.
- Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory of Computing Systems, 41(4):589–607, 2007.
- On a parallel-algorithms method for string matching problems (overview). Algorithms and Complexity, pages 22–32, 1994.
- Peter Weiner. Linear pattern matching algorithms. In Proc. 14th Annual Symposium on Switching and Automata Theory (SWAT), pages 1–11, 1973.