Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast (2310.09023v2)
Abstract: Sparse suffix sorting is the problem of sorting $b=o(n)$ suffixes of a string of length $n$. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in $\mathcal{O}(n\log b)$ time, in the worst case, or in $\mathcal{O}(n)$ time, when the total number of suffixes with an LCP value greater than $2{\lfloor \log \frac{n}{b} \rfloor + 1}-1$ is in $\mathcal{O}(b/\log b)$, matching the time of the optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only $8b+o(b)$ machine words. Our algorithms are non-trivial space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in $\mathcal{O}(n\log b)$ time [STACS 2014]. We provide extensive experiments to justify our claims on simplicity and on efficiency.
- Backyard cuckoo hashing: Constant worst-case operations with a succinct representation. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA, pages 787–796. IEEE Computer Society, 2010.
- Text indexing for long patterns: Anchors are all you need. Proc. VLDB Endow., 16(9):2117–2131, 2023.
- Time-space tradeoffs for finding a long common substring. In Inge Li Gørtz and Oren Weimann, editors, 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, June 17-19, 2020, Copenhagen, Denmark, volume 161 of LIPIcs, pages 5:1–5:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
- Substring complexity in sublinear space. CoRR, abs/2007.08357, 2020.
- Sparse text indexing in small space. ACM Trans. Algorithms, 12(3):39:1–39:19, 2016.
- Locally consistent parsing for text indexing in small space. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 607–626. SIAM, 2020.
- Longest common extension. Eur. J. Comb., 68:242–248, 2018.
- Selection and sorting in the ”restore” model. ACM Trans. Algorithms, 14(2):11:1–11:18, 2018.
- Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms, 17(1):8:1–8:39, 2021.
- Polynomial hash functions are reliable (extended abstract). In Werner Kuich, editor, Automata, Languages and Programming, 19th International Colloquium, ICALP92, Vienna, Austria, July 13-17, 1992, Proceedings, volume 623 of Lecture Notes in Computer Science, pages 235–246. Springer, 1992.
- Deterministic sparse suffix sorting in the restore model. ACM Trans. Algorithms, 16(4):50:1–50:53, 2020.
- Radix sorting with no extra space. In Lars Arge, Michael Hoffmann, and Emo Welzl, editors, Algorithms - ESA 2007, 15th Annual European Symposium, Eilat, Israel, October 8-10, 2007, Proceedings, volume 4698 of Lecture Notes in Computer Science, pages 194–205. Springer, 2007.
- Sparse suffix tree construction in optimal time and space. In Philip N. Klein, editor, Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 425–439. SIAM, 2017.
- Sampled suffix array with minimizers. Softw. Pract. Exp., 47(11):1755–1771, 2017.
- Faster sparse suffix sorting. In Ernst W. Mayr and Natacha Portier, editors, 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014), STACS 2014, March 5-8, 2014, Lyon, France, volume 25 of LIPIcs, pages 386–396. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2014.
- Linear work suffix array construction. J. ACM, 53(6):918–936, 2006.
- Sparse suffix trees. In Jin-yi Cai and C. K. Wong, editors, Computing and Combinatorics, Second Annual International Conference, COCOON ’96, Hong Kong, June 17-19, 1996, Proceedings, volume 1090 of Lecture Notes in Computer Science, pages 219–230. Springer, 1996.
- Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249–260, 1987.
- Linear-time longest-common-prefix computation in suffix arrays and its applications. In Amihood Amir and Gad M. Landau, editors, Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001 Jerusalem, Israel, July 1-4, 2001 Proceedings, volume 2089 of Lecture Notes in Computer Science, pages 181–192. Springer, 2001.
- Practical in-place mergesort. Nord. J. Comput., 3(1):27–40, 1996.
- Bidirectional string anchors: A new string sampling mechanism. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 64:1–64:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
- Universal compressed text indexing. Theor. Comput. Sci., 762:41–50, 2019.
- Three partition refinement algorithms. SIAM J. Comput., 16(6):973–989, 1987.
- Nicola Prezza. Optimal substring equality queries with applications to sparse text indexing. ACM Trans. Algorithms, 17(1):7:1–7:23, 2021.
- Simplified stable merging tasks. J. Algorithms, 8(4):557–571, 1987.