At the Roots of Dictionary Compression: String Attractors (1710.10964v4)

Published 30 Oct 2017 in cs.DS

Abstract: A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text's size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, followed by straight-line programs, run-length Burrows-Wheeler transform, macro schemes, collage systems, and the compact directed acyclic word graph. In this paper, we show that these techniques are different solutions to the same, elegant, combinatorial problem: to find a small set of positions capturing all text's substrings. We call such a set a string attractor. We first show reductions between dictionary compressors and string attractors. This gives the approximation ratios of dictionary compressors with respect to the smallest string attractor and uncovers new relations between the output sizes of different compressors. We show that the $k$-attractor problem: deciding whether a text has a size-$t$ set of positions capturing substrings of length at most $k$, is NP-complete for $k\geq 3$. We provide several approximation techniques for the smallest $k$-attractor, show that the problem is APX-complete for constant $k$, and give strong inapproximability results. To conclude, we provide matching lower and upper bounds for the random access problem on string attractors. The upper bound is proved by showing a data structure supporting queries in optimal time. Our data structure is universal: by our reductions to string attractors, it supports random access on any dictionary-compression scheme. In particular, it matches the lower bound also on LZ77, straight-line programs, collage systems, and macro schemes, and therefore closes (at once) the random access problem for all these compressors.

Citations (119)

View on Semantic Scholar

Summary

An Examination of String Attractors in Dictionary Compression

This paper provides a comprehensive investigation into the field of lossless text compression, demonstrating that various established dictionary-based compression algorithms are inherently solutions to a shared combinatorial problem. The innovative concept introduced, termed as a 'string attractor,' encapsulates the essence of this problem by efficiently capturing all distinct substrings within a text via a minimal set of positions.

Key Results and Contributions

Dominik Kempa and Nicola Prezza establish that prominent dictionary compression techniques, such as the Lempel-Ziv 77 algorithm and its successors, straight-line programs (SLPs), run-length Burrows-Wheeler Transform (RLBWT), and others, can be unified through the lens of string attractors. The paper demonstrates that the challenge of identifying the smallest string attractor underpins these diverse methods.

Reductions and Approximation Rates:
- The paper articulates reductions from dictionary compressors to string attractors, enabling insights into the approximation ratios of dictionary compressors relative to the smallest string attractor. For instance, it articulates that the size of the smallest string attractor ( $\gamma^*$ ) is tightly bounded by $O(z \log^2(n/z))$ where $z$ represents the length of the LZ77 parsing.
- This relationship serves not only to unify existing approaches but also to derive new asymptotic relationships between output sizes across different compression schemes.
Computational Complexity:
- The paper identifies the $k$ -attractor problem — determining whether a text contains a size- $t$ set capturing all substrings of length at most $k$ . This problem is shown to be NP-complete for $k \geq 3$ . They further substantiate that the attractor problem is in APX for constant $k$ by providing a $2k$-approximation computable in linear time. It is further characterized as APX-complete, establishing that there is no PTAS unless P = NP, and it is NP-hard to approximate within a factor of 11809/11808 - $\epsilon$ for $\epsilon > 0$ .
Data Structures and Random Access:
- On a practical level, the authors address the random access problem by offering a data structure facilitating optimal time extraction queries, thereby resolving a longstanding obstacle associated with dictionary compressors. They achieve optimal query time within $O(\gamma \log(n/\gamma) \log^\epsilon n)$ space, and this is applicable to various compression systems such as SLPs, RLSLPs, and more.

Broader Implications

The presented work has significant implications for the field of algorithmic text compression:

Theoretical Insight: By reframing established methods within the unifying framework of string attractors, the paper deepens theoretical understanding, enabling a clearer comparison of compression powers across diverse methods.
Algorithm Design: Opens pathways for developing refined dictionary compression algorithms that more closely approximate the smallest string attractor, potentially leading to enhanced compression ratios on highly repetitive strings.
Efficient Data Structures: The proposed universal data structure paves the way for more efficient computational operations on compressed texts, particularly in supporting fast random access.

Future Directions

The research encourages further theoretical and practical explorations. Future studies could delve into optimal approximation algorithms for string attractors or extend the utility of the attractor paradigm to complex queries like pattern matching or indexing within compressed data structures.

In summary, the paper makes a substantial contribution by highlighting a fundamental combinatorial problem shared among dictionary compressors, supported by rigorous proofs and practical data structure implementations, hence offering a cohesive narrative binding the disparate techniques of text compression. This novel framing through string attractors may inspire the next generation of compression algorithms and data handling strategies.

Related Papers

String Attractors (2017)
The 2-Attractor Problem is NP-Complete (2023)
Universal Compressed Text Indexing (2018)
String Attractors: Verification and Optimization (2018)
Online String Attractors (2024)

Tweets

https://twitter.com/sinya8282/status/1785113623163281440

YouTube

Show All Videos