Height-bounded Lempel-Ziv encodings (2403.08209v2)
Abstract: We introduce height-bounded LZ encodings (LZHB), a new family of compressed representations that are variants of Lempel-Ziv parsings with a focus on bounding the worst-case access time to arbitrary positions in the text directly via the compressed representation. An LZ-like encoding is a partitioning of the string into phrases of length $1$ which can be encoded literally, or phrases of length at least $2$ which have a previous occurrence in the string and can be encoded by its position and length. An LZ-like encoding induces an implicit referencing forest on the set of positions of the string. An LZHB encoding is an LZ-like encoding where the height of the implicit referencing forest is bounded. An LZHB encoding with height constraint $h$ allows access to an arbitrary position of the underlying text using $O(h)$ predecessor queries. While computing the smallest LZHB encoding efficiently seems to be difficult [Cicalese & Ugazio 2024, arxiv], we give the first linear time algorithm for strings over a constant size alphabet that computes the greedy LZHB encoding, i.e., the string is processed from beginning to end, and the longest prefix of the remaining string that can satisfy the height constraint is taken as the next phrase. Our algorithms significantly improve both theoretically and practically, the very recently and independently proposed algorithms by Lipt\'ak et al. (arxiv, to appear at CPM 2024). We also analyze the size of height bounded LZ encodings in the context of repetitiveness measures, and show for some constant $c$, the size $z_{HB}$ of the optimal LZHB encoding with height bound $c\log n$ is $O(g_{rl})$, where $g_{rl}$ is the size of the smallest run-length grammar. We also show $z_{HB} = o(g_{rl})$ for some family of strings, making $z_{HB}$ one of the smallest known repetitiveness measures for which $O({\sf polylog} n)$ time access is possible using linear space.
- Optimal lz-end parsing is hard. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 3:1–3:11. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023.
- Block trees. Journal of Computer and System Sciences, 117:1–22, 2021.
- Weighted ancestors in suffix trees revisited. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 8:1–8:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
- Crochemore factorization of sturmian and other infinite words. In Rastislav Kralovic and Pawel Urzyczyn, editors, Mathematical Foundations of Computer Science 2006, 31st International Symposium, MFCS 2006, Stará Lesná, Slovakia, August 28-September 1, 2006, Proceedings, volume 4162 of Lecture Notes in Computer Science, pages 157–166. Springer, 2006.
- A separation between RLSLPs and LZ77. J. Discrete Algorithms, 50:36–39, 2018.
- The smallest grammar problem. IEEE Trans. Inf. Theory, 51(7):2554–2576, 2005.
- The lempel–ziv complexity of fixed points of morphisms. SIAM J. Discret. Math., 21(2):466–481, 2007.
- Computing the longest previous factor. Eur. J. Comb., 34(1):15–26, 2013.
- Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143. IEEE Computer Society, 1997.
- Analyzing relative Lempel-Ziv reference construction. In Shunsuke Inenaga, Kunihiko Sadakane, and Tetsuya Sakai, editors, String Processing and Information Retrieval - 23rd International Symposium, SPIRE 2016, Beppu, Japan, October 18-20, 2016, Proceedings, volume 9954 of Lecture Notes in Computer Science, pages 160–165, 2016.
- Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
- On the hardness of smallest RLSLPs and collage systems. In Proc. Data Compression Conference DCC 2024, 2024. accepted.
- An upper bound and linear-space queries on the LZ-end parsing. In Joseph (Seffi) Naor and Niv Buchbinder, editors, Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, Virtual Conference / Alexandria, VA, USA, January 9 - 12, 2022, pages 2847–2866. SIAM, 2022.
- Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977.
- Tomasz Kociumaka. personal communication, June 2023.
- Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory, 69(4):2074–2092, 2023.
- On compressing and indexing repetitive sequences. Theor. Comput. Sci., 483:115–133, 2013.
- Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In Edgar Chávez and Stefano Lonardi, editors, String Processing and Information Retrieval - 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings, volume 6393 of Lecture Notes in Computer Science, pages 201–206. Springer, 2010.
- Gonzalo Navarro. Compact Data Structures - A Practical Approach. Cambridge University Press, 2016.
- Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1–29:31, 2022.
- Gonzalo Navarro. Indexing highly repetitive string collections, part II: compressed indexes. ACM Comput. Surv., 54(2):26:1–26:32, 2022.
- On the approximation ratio of ordered parsings. IEEE Trans. Inf. Theory, 67(2):1008–1026, 2021.
- Balancing run-length straight-line programs. In Diego Arroyuelo and Barbara Poblete, editors, String Processing and Information Retrieval - 29th International Symposium, SPIRE 2022, Concepción, Chile, November 8-10, 2022, Proceedings, volume 13617 of Lecture Notes in Computer Science, pages 117–131. Springer, 2022.
- Iterated straight-line programs. In Proc. LATIN 2024, Lecture Notes in Computer Science, 2024.
- Fully dynamic data structure for LCE queries in compressed space. In 41st International Symposium on Mathematical Foundations of Computer Science, MFCS 2016, August 22-26, 2016 - Kraków, Poland, volume 58 of LIPIcs, pages 72:1–72:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016.
- Application of Lempel-Ziv encodings to the solution of words equations. In Kim Guldstrand Larsen, Sven Skyum, and Glynn Winskel, editors, Automata, Languages and Programming, 25th International Colloquium, ICALP’98, Aalborg, Denmark, July 13-17, 1998, Proceedings, volume 1443 of Lecture Notes in Computer Science, pages 731–742. Springer, 1998.
- J. STORER. NP-completeness results concerning data compression. Technical Report 234, 1977.
- Data compression via textual substitution. J. ACM, 29(4):928–951, 1982.
- Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.
- Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973.
- A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337–343, 1977.