BAT-LZ Out of Hell (2403.09893v2)
Abstract: Despite consistently yielding the best compression on repetitive text collections, the Lempel-Ziv parsing has resisted all attempts at offering relevant guarantees on the cost to access an arbitrary symbol. This makes it less attractive for use on compressed self-indexes and other compressed data structures. In this paper we introduce a variant we call BAT-LZ (for Bounded Access Time Lempel-Ziv) where the access cost is bounded by a parameter given at compression time. We design and implement a linear-space algorithm that, in time $O(n\log3 n)$, obtains a BAT-LZ parse of a text of length $n$ by greedily maximizing each next phrase length. The algorithm builds on a new linear-space data structure that solves 5-sided orthogonal range queries in rank space, allowing updates to the coordinate where the one-sided queries are supported, in $O(\log3 n)$ time for both queries and updates. This time can be reduced to $O(\log2 n)$ if $O(n\log n)$ space is used. We design a second algorithm that chooses the sources for the phrases in a clever way, using an enhanced suffix tree, albeit no longer guaranteeing longest possible phrases. This algorithm is much slower in theory, but in practice it is comparable to the greedy parser, while achieving significantly superior compression. We then combine the two algorithms, resulting in a parser that always chooses the longest possible phrases, and the best sources for those. Our experimentation shows that, on most repetitive texts, our algorithms reach an access cost close to $\log_2 n$ on texts of length $n$, while incurring almost no loss in the compression ratio when compared with classical LZ-compression. Several open challenges are discussed at the end of the paper.
- Alberto Apostolico. The myriad virtues of subword trees. In Combinatorial Algorithms on Words, NATO ISI Series, pages 85–96. Springer-Verlag, 1985.
- Range predecessor and Lempel-Ziv parsing. In Proc. 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2053–2071, 2016.
- Random access to grammar-compressed strings and trees. SIAM Journal on Computing, 44(3):513–539, 2015.
- Orthogonal point location and rectangle stabbing queries in 3-d. Journal of Computational Geometry, 13(1), 2022.
- Dynamic orthogonal range searching on the RAM, revisited. Journal of Computational Geometry, 9(2):45–66, 2018.
- The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554–2576, 2005.
- Lempel-Ziv factorization using less time & space. Mathematics in Computer Science, 1:605–623, 2008.
- Personal Communication.
- David R. Clark. Compact PAT Trees. PhD thesis, University of Waterloo, Canada, 1996.
- Grammar-compressed indexes with logarithmic search time. Journal of Computer and System Sciences, 118:53–74, 2021.
- The wavelet matrix: An efficient wavelet tree for large alphabets. Information Systems, 47:15–32, 2015.
- Martin Farach. Optimal suffix tree construction with large alphabets. In Proc. 38th Annual Symposium on Foundations of Computer Science (FOCS), pages 137–143. IEEE Computer Society, 1997.
- Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing, 40(2):465–492, 2011.
- Lempel Ziv computation in small space (LZ-CISS). In Proc. 26th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 9133, pages 172–184, 2015.
- Lempel-Ziv factorization powered by space efficient suffix trees. Algorithmica, 80(7):2048–2081, 2018.
- Balancing straight-line programs. Journal of the ACM, 68(4):article 27, 2021.
- Simpler and faster Lempel Ziv factorization. In Proc. 23rd Data Compression Conference (DCC), pages 133–142, 2013.
- D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
- Lightweight Lempel-Ziv parsing. In Proc. 12th International Symposium on Experimental Algorithms (SEA), pages 139–150, 2013.
- Linear time Lempel-Ziv factorization: Simple, fast, small. In Proc. 24th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 7922, pages 189–200, 2013.
- Lazy Lempel-Ziv factorization algorithms. ACM Journal of Experimental Algorithmics, 21(1):2.4:1–2.4:19, 2016.
- LZ-End parsing in compressed space. In Proc. 27th Data Compression Conference (DCC), pages 350–359, 2017.
- LZ-End parsing in linear time. In Proc. 25th Annual European Symposium on Algorithms (ESA), pages 53:1–53:14, 2017.
- An upper bound and linear-space queries on the LZ-End parsing. In Proc. 33rd ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2847–2866, 2022.
- Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737–754, 2000.
- Lempel-Ziv computation in compressed space (LZ-CICS). In Proc. 26th Data Compression Conference (DCC), pages 3–12, 2016.
- Lz77-like compression with fast random access. In Proc. 20th Data Compression Conference (DCC), pages 239–248, 2010.
- On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115–133, 2013.
- On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(1):75–81, 1976.
- M. Lothaire. Algebraic Combinatorics on Words. Cambridge University Press, 2002.
- Edward M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262–272, 1976.
- J. Ian Munro. Tables. In Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), LNCS 1180, pages 37–42, 1996.
- Fast construction of wavelet trees. Theoretical Computer Science, 638:91–97, 2016.
- Gonzalo Navarro. Compact Data Structures – A practical approach. Cambridge University Press, 2016.
- Gonzalo Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys, 54(2):article 29, 2021.
- Gonzalo Navarro. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Computing Surveys, 54(2):article 26, 2021.
- Yakov Nekrich. Orthogonal range searching in linear and almost-linear space. Computational Geometry, 42(4):342–351, 2009.
- Two efficient algorithms for linear time suffix array construction. IEEE Transactions on Computers, 60(10):1471–1484, 2011.
- Lempel-Ziv factorization revisited. In Proc. 22nd Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 6661, pages 15–26, 2011.
- A taxonomy of suffix array construction algorithms. ACM Computing Surveys, 39(2):article 4, 2007.
- Linear algorithm for data compression via string matching. Journal of the ACM, 28(1):16–24, 1981.
- Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science, 302(1-3):211–222, 2003.
- Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.
- Data structure lower bounds on random access to grammar-compressed strings. In Proc. 24th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 7922, pages 247–258, 2013.
- Peter Weiner. Linear pattern matching algorithms. In Proc. 14th Annual Symposium on Switching and Automata Theory (SWAT), pages 1–11. IEEE Computer Society, 1973.