Compressed Text Indexes:From Theory to Practice! (0712.3360v1)

Published 20 Dec 2007 in cs.DS

Abstract: A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this paper is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective testbeds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and exciting technology.

Authors (4)

Paolo Ferragina (21 papers)
Rodrigo Gonzalez (1 paper)
Gonzalo Navarro (121 papers)
Rossano Venturini (24 papers)

Citations (173)

View on Semantic Scholar

Summary

Compressed Text Indexes: From Theory to Practice

The paper "Compressed Text Indexes: From Theory to Practice" deals with the advances and implications of implementing Compressed Full-Text Self-Indexes (commonly referred to as compressed indexes) in practical applications. This discussion focuses on bridging the gap between theoretical advancements and their practical deployment, primarily through the introduction of the Pizza&Chili Corpus, a comprehensive suite for benchmarking such structures.

Overview of Compressed Indexes

Compressed indexes represent a text in a compressed form while still efficiently supporting operations such as searching and extracting text segments. Traditionally, text indexes like suffix trees and suffix arrays have provided rapid search capabilities at the cost of significant additional storage space—often 4-20 times the text size. Compressed indexes, however, have been formulated to replace the original text entirely, offering the dual advantage of reduced storage requirement and enhanced functionality, as they offer the same query abilities as their uncompressed counterparts with reduced space, theoretically bound by the k-th order entropy of the text.

Key Contributions

Algorithmic and Implementation Challenges: The paper highlights the complexities involved in implementing compressed text indexes. The research transitions theoretical models into practical tools involves significant challenges, requiring deep algorithmic knowledge and extensive programming efforts.
Pizza&Chili Site: A noteworthy contribution is the establishment of the Pizza&Chili corpus, which provides developers with streamlined access to implementations of various compressed index algorithms. These implementations conform to a unified API, facilitating their integration in diverse applications.
Experimental Evaluation: Through Pizza&Chili, the authors provide empirical analyses of these indexes to underscore their practical utility. This evaluation recognizes the robustness of various algorithms across different data types—such as DNA sequences, natural language text, and structured data—and analyzes their space-time tradeoffs.

Practical and Theoretical Implications

Compressed indexes represent a natural evolution from theoretical constructs to practical utility, potentially transforming the approach to text processing in environments where storage efficiency is critical, including bioinformatics and linguistics. The implications are both theoretical and practical:

Theoretical Implications: As compressed indexes are shown to be proportional in size to the compressed text, further advancements in understanding text compressibility and its limits will directly impact the development of more efficient indexes.
Practical Implications: From a practical standpoint, the benchmarking and APIs offered via the Pizza&Chili site significantly lower the barrier for implementing compressed indexes in real-world applications. Furthermore, by reducing the space complexity traditionally linked to full-text indexing, compressed indexes allow for efficient data handling without sacrificing search time efficiency—a critical factor in large-scale data processing tasks.

Future Directions

The paper suggests that ongoing algorithmic improvements and a better understanding of hierarchy in memory architectures could mitigate the slow-downs experienced during cache/IO operations, thereby pushing the practical limits of compressed indexes closer to their theoretical ideals. Furthermore, the continuous evolution of compression algorithms promises ongoing enhancements in indexing efficiency.

In conclusion, "Compressed Text Indexes: From Theory to Practice" maps out a clear path from theoretical innovation to practical reality, setting a foundation for further research and development in compressed data structures. Through this work, the authors have not only demonstrated the viability of compressed indexes but also provided the necessary tools and evaluations for their broad adoption and further advancement in various text processing domains.