Random Access to Grammar Compressed Strings (1001.1565v3)

Published 11 Jan 2010 in cs.DS

Abstract: Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string. Let $S$ be a string of length $N$ compressed into a context-free grammar $\mathcal{S}$ of size $n$. We present two representations of $\mathcal{S}$ achieving $O(\log N)$ random access time, and either $O(n\cdot \alpha_k(n))$ construction time and space on the pointer machine model, or $O(n)$ construction time and space on the RAM. Here, $\alpha_k(n)$ is the inverse of the $k^{th}$ row of Ackermann's function. Our representations also efficiently support decompression of any substring in $S$: we can decompress any substring of length $m$ in the same complexity as a single random access query and additional $O(m)$ time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern $P$ with at most $k$ errors in time $O(n(\min{|P|k, k⁴ + |P|} + \log N) + occ)$, where $occ$ is the number of occurrences of $P$ in $S$. Finally, we generalize our results to navigation and other operations on grammar-compressed ordered trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy paths in grammars.

Citations (160)

View on Semantic Scholar

Summary

The paper introduces novel representations allowing efficient O(log N) time random access to individual characters in grammar-compressed strings without full decompression.
The proposed methods enable fast substring decompression and improve approximate string matching algorithms on compressed data.
The research extends these efficient techniques to support navigation and operations on grammar-compressed trees, generalizing the benefits to tree-structured data.

Overview of the Paper: Random Access to Grammar-Compressed Strings and Trees

The paper focuses on the challenge of enabling efficient random access operations on massive datasets represented in a compressed form using grammar-based methods. This compression technique involves using context-free grammars (CFG) to represent strings and trees, which is a powerful concept that captures several popular schemes such as Lempel-Ziv, Run-Length Encoding, Byte-Pair Encoding, Sequitur, and Re-Pair. This paradigm offers the potential for significantly reduced space requirements, essential in environments such as biological databases and web data repositories.

Main Contributions:

Grammar Representation and Random Access:
- The authors introduce two novel representations of grammar-compressed strings that allow direct random access to any given character or substring without full decompression. This is achieved with $O(\log N)$ time complexity for accessing any specific character in a string $S$ of length $N$ compressed into a grammar $\mathcal{S}$ of size $n$ .
- There are two approaches explored: (1) achieving access in $O(\log N)$ time with $O(n \cdot \alpha_k(n))$ construction complexity using pointer machine models, and (2) ensuring $O(n)$ construction time and space on a RAM model.
Efficient Substring Decompression:
- The paper extends the grammar representation strategy to decompression of substrings. Any substring of length $m$ can be decompressed in a complexity matching a single random access query plus $O(m)$ time, thus allowing for rapid extraction of parts of the dataset.
Approximate String Matching:
- Combining efficient random access and substring decompression, the authors improve compressed approximate string matching algorithms. With implications for tasks such as pattern matching or error detection in strings, this method circumvents the need for initial full decompression which can be computationally prohibitive.
Tree Navigation:
- The research further generalizes these results to ordered trees represented in compressed formats. Here, the paper showcases advancements in navigating grammar-compressed trees and performing operations like parent, lca, and subtree manipulations efficiently.

The techniques developed in this paper leverage several independent data structures and algorithms, including biased weighted ancestor data structures and compact heavy path representations which independently offer contributions to computational theory.

Implications and Future Developments:

The work presents promising implications for theoretical advancements and practical applications alike. The ability to efficiently access and process data in its compressed form represents a major advancement in data management, especially pertinent in contexts involving large-scale dynamically structured data like XML and genomic databases.

Future developments in this area could aim at refining these algorithms further to reduce construction time and space even more, possibly harnessing newer computational models or hardware developments. Additionally, hybrid approaches integrating other compression schemes might evolve, offering even greater efficiency and functionality.

Overall, this paper provides a substantial contribution to the field of data compression, bridging theoretical constructs with practical implementations, and setting a precedent for further innovations in efficient computation over compressed data structures.

Random Access to Grammar Compressed Strings (1001.1565v3)

Summary

Overview of the Paper: Random Access to Grammar-Compressed Strings and Trees

Related Papers