Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rank and Select on Degenerate Strings (2310.19702v3)

Published 30 Oct 2023 in cs.DS

Abstract: A 'degenerate string' is a sequence of subsets of some alphabet; it represents any string obtainable by selecting one character from each set from left to right. Recently, Alanko et al. generalized the rank-select problem to degenerate strings, where given a character $c$ and position $i$ the goal is to find either the $i$th set containing $c$ or the number of occurrences of $c$ in the first $i$ sets [SEA 2023]. The problem has applications to pangenomics; in another work by Alanko et al. they use it as the basis for a compact representation of 'de Bruijn Graphs' that supports fast membership queries. In this paper we revisit the rank-select problem on degenerate strings, introducing a new, natural parameter and reanalyzing existing reductions to rank-select on regular strings. Plugging in standard data structures, the time bounds for queries are improved exponentially while essentially matching, or improving, the space bounds. Furthermore, we provide a lower bound on space that shows that the reductions lead to succinct data structures in a wide range of cases. Finally, we provide implementations; our most compact structure matches the space of the most compact structure of Alanko et al. while answering queries twice as fast. We also provide an implementation using modern vector processing features; it uses less than one percent more space than the most compact structure of Alanko et al. while supporting queries four to seven times faster, and has competitive query time with all the remaining structures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Karl R. Abrahamson. Generalized string matching. SIAM J. Comput., 16(6):1039–1051, 1987. doi:10.1137/0216067.
  2. Subset wavelet trees. In Proc. 21st SEA, pages 4:1–4:14, 2023. doi:10.4230/LIPIcs.SEA.2023.4.
  3. Small searchable κ𝜅\kappaitalic_κ-spectra via subset rank queries on the spectral burrows-wheeler transform. In Proc. ACDA, 2023, pages 225–236, 2023. doi:10.1137/1.9781611977714.20.
  4. Comparing degenerate strings. Fundam. Informaticae, 175(1-4):41–58, 2020. doi:10.3233/FI-2020-1947.
  5. Efficient fully-compressed sequence representations. Algorithmica, 69(1):232–268, 2014. doi:10.1007/S00453-012-9726-3.
  6. Succinct indexes for strings, binary relations and multilabeled trees. ACM Trans. Algorithms, 7(4):52:1–52:27, 2011. doi:10.1145/2000807.2000820.
  7. Access, rank, and select in grammar-compressed strings. In Proc. 23rd ESA, pages 142–154, 2015. doi:10.1007/978-3-662-48350-3_13.
  8. Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms, 11(4):31:1–31:21, 2015. doi:10.1145/2629339.
  9. Efficient suffix trees on secondary storage (extended abstract). In Proc. 7th SODA, pages 383–391, 1996. URL: http://dl.acm.org/citation.cfm?id=313852.314087.
  10. Covering problems for partial words and for indeterminate strings. Theor. Comput. Sci., 698:25–39, 2017. doi:10.1016/J.TCS.2017.05.026.
  11. Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms, 3(2):20, 2007. doi:10.1145/1240233.1240243.
  12. Travis Gagie. Rank and select operations on sequences. In Encyclopedia of Algorithms, pages 1776–1780. 2016. doi:10.1007/978-1-4939-2864-4_638.
  13. From theory to practice: Plug and play with succinct data structures. In Proc. 13th SEA, pages 326–337, 2014.
  14. Rank/select operations on large alphabets: a tool for text indexing. In Proc. 15th SODA, pages 368–373, 2006. URL: http://dl.acm.org/citation.cfm?id=1109557.1109599.
  15. Optimal indexes for sparse bit vectors. Algorithmica, 69(4):906–924, 2014. doi:10.1007/S00453-013-9767-2.
  16. High-order entropy-compressed text indexes. In Proc. 14th SODA, pages 841–850, 2003.
  17. Dynamic compressed strings with random access. In Proc. 40th ICALP, pages 504–515, 2013. doi:10.1007/978-3-642-39206-1_43.
  18. Succinct representations of dynamic strings. In Proc. 17th SPIRE, pages 334–346, 2010. doi:10.1007/978-3-642-16321-0_35.
  19. A new approach to pattern matching in degenerate DNA/RNA sequences and distributed pattern matching. Math. Comput. Sci., 1(4):557–569, 2008. doi:10.1007/S11786-007-0029-Z.
  20. Guy Jacobson. Space-efficient static trees and graphs. In Proc. FOCS, pages 549–554, 1989. doi:10.1109/SFCS.1989.63533.
  21. Differences in fecal microbiomes and metabolomes of people with vs without irritable bowel syndrome and bile acid malabsorption. Gastroenterology, 158(4):1016–1028, 2020.
  22. On elias-fano for rank queries in fm-indexes. In Proc. DCC, 2021, pages 223–232, 2021.
  23. Rank and select revisited and extended. Theor. Comput. Sci., 387(3):332–347, 2007. doi:10.1016/J.TCS.2007.07.013.
  24. Optimal dynamic sequence representations. SIAM J. Comput., 43(5):1781–1806, 2014. doi:10.1137/130908245.
  25. Fully functional static and dynamic succinct trees. ACM Trans. Algorithms, 10(3):16:1–16:39, 2014. doi:10.1145/2601073.
  26. Practical Entropy-Compressed Rank/Select Dictionary. In Proc. 9th ALENEX, 2007. doi:10.1137/1.9781611972870.6.
  27. Grammar compressed sequences with rank/select support. J. Discrete Algorithms, 43:54–71, 2017. doi:10.1016/J.JDA.2016.10.001.
  28. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms, 3(4):43, 2007. doi:10.1145/1290672.1290680.
Citations (3)

Summary

We haven't generated a summary for this paper yet.