Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Computing all-vs-all MEMs in grammar-compressed text (2306.16815v1)

Published 29 Jun 2023 in cs.IR and cs.DS

Abstract: We describe a compression-aware method to compute all-vs-all maximal exact matches (MEM) among strings of a repetitive collection $\mathcal{T}$. The key concept in our work is the construction of a fully-balanced grammar $\mathcal{G}$ from $\mathcal{T}$ that meets a property that we call \emph{fix-free}: the expansions of the nonterminals that have the same height in the parse tree form a fix-free set (i.e., prefix-free and suffix-free). The fix-free property allows us to compute the MEMs of $\mathcal{T}$ incrementally over $\mathcal{G}$ using a standard suffix-tree-based MEM algorithm, which runs on a subset of grammar rules at a time and does not decompress nonterminals. By modifying the locally-consistent grammar of Christiansen et al 2020., we show how we can build $\mathcal{G}$ from $\mathcal{T}$ in linear time and space. We also demonstrate that our MEM algorithm runs on top of $\mathcal{G}$ in $O(G +occ)$ time and uses $O(\log G(G+occ))$ bits, where $G$ is the grammar size, and $occ$ is the number of MEMs in $\mathcal{T}$. In the conclusions, we discuss how our idea can be modified to implement approximate pattern matching in compressed space.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53–86, 2004.
  2. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990.
  3. Oblivious string embeddings and edit distance approximations. In Proc. 17th Symposium on Discrete Algorithms (SODA), pages 792–801, 2006.
  4. PHONI: Streamed matching statistics with multi-genome references. In Proc. 21st Data Compression Conference (DCC), pages 193–202, 2021.
  5. Sublinear approximate string matching and biological applications. Algorithmica, 12(4):327–344, 1994.
  6. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554–2576, 2005.
  7. Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms, 17(1):1–39, 2020.
  8. Improved grammar-based compressed indexes. In Proc. 19th SPIRE, pages 180–192, 2012.
  9. Grammar-compressed indexes with logarithmic search time. Journal of Computer and System Sciences, 118:53–74, 2021.
  10. Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms. In Proc. 18th Anual Symposium on Theory of Computing (STOC), pages 206–219, 1986.
  11. A grammar compressor for collections of reads with applications to the construction of the BWT. In Proc. 31st Data Compression Conference (DCC), pages 83–92, 2021.
  12. Johannes Fischer. Optimal succinctness for range minimum queries. In Proc. 9th Latin American Symposium, pages 158–169, 2010.
  13. Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1):article 2, 2020.
  14. Artur Jeż. Approximation of grammar-based compression via recompression. Theoretical Computer Science, 592:115–134, 2015.
  15. W. James Kent. BLAT—the BLAST-like alignment tool. Genome Research, 12(4):656–664, 2002.
  16. Grammar-based codes: a new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737–754, 2000.
  17. Versatile and open software for comparing large genomes. Genome Biology, 5:1–9, 2004.
  18. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proc. 12th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 181–192, 2001.
  19. Fast gapped-read alignment with bowtie 2. Nature Methods, 9(4):357–359, 2012.
  20. Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997, 2013.
  21. Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, 2018.
  22. Genome-Scale Algorithm Design. Cambridge University Press, 2015.
  23. Suffix arrays: a new method for on–line string searches. SIAM Journal on Computing, 22(5):935–948, 1993.
  24. Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262–272, 1976.
  25. Gonzalo Navarro. Computing MEMs on repetitive text collections. In Proc. 34th Annual Symposium on Combinatorial Pattern Matching (CPM), page article 22, 2023.
  26. Linear suffix array construction by almost pure induced-sorting. In Proc. 19th Data Compression Conference (DCC), pages 193–202, 2009.
  27. A grammar compression algorithm based on induced suffix sorting. In Proc. 28th Data Compression Conference (DCC), pages 42–51, 2018.
  28. CST++. In Proc. 17th International Symposium on String Processing and Information Retrieval (SPIRE), pages 322–333, 2010.
  29. Finding maximal exact matches using the r-index. Journal of Computational Biology, 29(2):188–194, 2022.
  30. MONI: A pangenomic index for finding maximal exact matches. Journal of Computational Biology, 29(2):169–187, 2022.
  31. Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory of Computing Systems, 41(4):589–607, 2007.
  32. On a parallel-algorithms method for string matching problems (overview). Algorithms and Complexity, pages 22–32, 1994.
  33. Peter Weiner. Linear pattern matching algorithms. In Proc. 14th Annual Symposium on Switching and Automata Theory (SWAT), pages 1–11, 1973.
Citations (1)

Summary

We haven't generated a summary for this paper yet.