Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compressed Indexing Structures Overview

Updated 1 April 2026
  • Compressed indexing structures are specialized data structures that represent text near its entropy while supporting efficient search operations.
  • They leverage both statistical and repetitive-aware compression methods, including FM-index, RLBWT, and grammar-based techniques, to minimize space usage.
  • They are crucial for applications in genomics, versioned documents, and web archives, yet challenges remain in dynamic updates and 2D data indexing.

Compressed indexing structures are specialized data structures designed to represent text or document collections in space close to their empirical entropy or intrinsic repetitiveness, while simultaneously providing efficient support for fundamental search operations such as pattern matching, counting, and locating. The rapid proliferation of massive, highly repetitive datasets in domains such as genomics, versioned document repositories, and large web archives has driven the development of a new generation of indexes that go beyond classical statistical compression, actively exploiting repetition at multiple structural levels (Navarro, 2020). These structures form the backbone of state-of-the-art text retrieval, search, and bioinformatics pipelines.

1. Compression Paradigms in Indexing

Two major paradigms underpin compressed indexing structures: statistical compression and repetitive-aware (dictionary-based) compression.

  • Statistical Compression: Utilizes methods such as entropy coding, Burrows-Wheeler Transform (BWT)-based indexes, and empirical entropy measures HkH_k to achieve a space bound of nHk(T)+o(nlogσ)nH_k(T)+o(n \log \sigma) bits for text TT of length nn over alphabet Σ\Sigma. The FM-index exemplifies this approach, supporting pattern search and random access within compressed space, where HkH_k is the kkth order entropy (Kärkkäinen et al., 2011). Fixed block compression boosting has further simplified and improved practical FM-indexes, achieving these bounds in streaming settings and with block-wise construction.
  • Repetitive-Aware Compression: Exploits repetitions intrinsic to highly redundant data, achieving substantially better compression on, e.g., genomic datasets or document collections with extensive versioning. Key representations include LZ77/LZ78 parsing, grammar-based compression (SLPs, RLSLPs), the run-length compressed BWT (RLBWT) and associated "r-index", as well as attractor-based universal indexing (Navarro et al., 2018, Christiansen et al., 2017). These structures achieve space proportional to parameters like the number of phrases (zz), number of grammar rules (gg), or the number of BWT runs (rr), which can be orders of magnitude smaller than nHk(T)+o(nlogσ)nH_k(T)+o(n \log \sigma)0 in practical scenarios.

2. Fundamental Algorithmic Ideas and Structures

Compressed indexes are constructed and operated upon via a set of core algorithmic primitives. The most notable are:

  • Burrows-Wheeler Transform (BWT) and Run-Length Encoded BWT (RLBWT): BWT forms the basis of statistical indexes as well as compressed self-indexes, transforming the original text so that repetitions become immediately compressible. The RLBWT encodes the BWT output as runs, and indexes based on the RLBWT ("r-index") exploit this property to achieve nHk(T)+o(nlogσ)nH_k(T)+o(n \log \sigma)1 space for nHk(T)+o(nlogσ)nH_k(T)+o(n \log \sigma)2 runs (Navarro, 2020).
  • Suffix Automata and Automata-Based Indexes: Structures such as the Compact Directed Acyclic Word Graph (CDAWG) enable efficient enumeration and conversion between various compressed arrays (RLBWT, irreducible PLCP, LPF arrays, LZ77 parse) with nHk(T)+o(nlogσ)nH_k(T)+o(n \log \sigma)3 worst-case time and space where nHk(T)+o(nlogσ)nH_k(T)+o(n \log \sigma)4 is the number of CDAWG edges (Arimura et al., 2023).
  • Dictionary and Grammar-Based Compression: Universal compressed indexes can leverage any dictionary-compressed representation, including LZ77, LZ78, macro schemes, and string attractors. Recent advances allow for construction of compressed indexes in nHk(T)+o(nlogσ)nH_k(T)+o(n \log \sigma)5 space and nHk(T)+o(nlogσ)nH_k(T)+o(n \log \sigma)6 time, where nHk(T)+o(nlogσ)nH_k(T)+o(n \log \sigma)7 is the attractor size (Navarro et al., 2018).
  • Compressed Suffix Arrays (CSA) and Suffix Trees: These provide full suffix functionality (e.g., nHk(T)+o(nlogσ)nH_k(T)+o(n \log \sigma)8/nHk(T)+o(nlogσ)nH_k(T)+o(n \log \sigma)9 queries) in compressed space. The latest results collapse the compressed-index "hierarchy", showing that TT0 space suffices for full suffix-array functionality, where TT1 is the substring complexity (Kempa et al., 2023).
  • Variable-Length Blocking and Cache-Aware Layouts: Adaptively partitioning the BWT or related arrays according to local compressibility enables improved space-time trade-offs, as in the variable-length blocking (VLB) technique for BWT-based CSA structures (Díaz-Domínguez et al., 19 Feb 2026).
  • Learning-Based Compression: Recent approaches address static (e.g., Compressed PGM-index (Ferragina et al., 2019)) and inverted index structures (Oosterhuis et al., 2018) by replacing or supplementing classic data structures with learned models—yielding compact representations with competitive or superior performance for predecessor, rank, and postings-list queries.

3. Theoretical and Practical Trade-Offs

Compressed indexes are evaluated across a spectrum of theoretical and empirical criteria. The key dimensions include:

Index Paradigm Space Complexity Query Time Update Support
FM-index (Statistical) TT2 TT3 Static (dynamic via (Munro et al., 2015))
RLBWT/r-index (Repetitive-aware) TT4 TT5 Static
LZ77/grammar-based TT6/TT7 TT8 Static
Attractor-based (Universal) TT9 nn0 Static
PGM-/Learned-index nn1 (m=segments) nn2 Static
VLB (BWT/CSA) nn3 (with tuning) nn4 Static
Dynamic Framework nn5 Near-static Incremental

Where nn6 is the pattern/query length, nn7 is the number of pattern occurrences, and other parameters as before.

Notably, the framework of (Munro et al., 2015) allows converting any static compressed index into a dynamic one at the cost of an additive polylogarithmic factor in time and negligible space blowup, circumventing lower bounds for dynamic rank via careful organization of data and background rebuilds.

4. Extensions and Applications

Compressed indexing structures extend far beyond basic text search:

  • Graph and Binary Relation Indexing: Techniques for compressed dynamic graphs and binary relations adapt the static→dynamic conversion and entropy-aware encoding (Munro et al., 2015).
  • RDF/Semantic Data: Compressed trie layouts with cross-compression and permutation reduction optimize index size and query throughput for large-scale RDF datasets and SPARQL workloads (Perego et al., 2019).
  • 2D Compressed Indexing: Extensions to 2D datasets (matrices, images) have been developed with optimal random access, but conditional lower bounds separate the 2D case from 1D for pattern matching and related queries (De et al., 22 Oct 2025).
  • Compressed Pattern Queries: Indexes that efficiently answer queries posed in compressed (e.g., LZ77) pattern representation, as in a client-server search context (Bille et al., 2019).

5. Empirical Performance and Implementation Insights

Empirical studies reveal dramatic improvements in both space and time. State-of-the-art compressed indexes routinely achieve:

6. Challenges and Open Problems

Despite strong progress, key challenges persist:

  • Optimality and Hierarchy Collapse: Only recently has it been proved that the hierarchy from random access through LCE to full suffix array queries can be collapsed to the fundamental substring complexity nn9 (cf. (Kempa et al., 2023)), eliminating previous space gaps for powerful queries.
  • Update Complexity: While static-to-dynamic frameworks exist, supporting efficient updates in the presence of highly compressed representations remains nontrivial, especially for sophisticated dictionary-based schemes.
  • Generality and Universality: Attractor-based universal indexes suggest a deep relation between compression and indexing, but practical universality over all dictionary compressors and beyond remains an area for further research (Navarro et al., 2018).
  • 2D and Multimodal Indexing: The development of compressed indexing structures for structured, multidimensional, and heterogeneous data is ongoing (De et al., 22 Oct 2025).
  • Learned Indexes: Robustness, error bounds, and dynamization of learned-index approaches are still active topics of research (Oosterhuis et al., 2018, Ferragina et al., 2019).

7. Comparative Context and Future Directions

Compressed indexes now provide a mature foundation for massive text-centric data systems, with applications in genomics, web indexing, data archival, and semantic search. The field is characterized by a continuous interplay between theoretical advances (e.g., in entropy bounds, lower bounds, and universality), new algorithmic paradigms (e.g., attractor-based, learning-based), and pragmatic engineering (e.g., recompression, SIMD optimization, variable-length blocking) (Navarro, 2020, Adudodla et al., 13 Jun 2025, Díaz-Domínguez et al., 19 Feb 2026).

Future work will likely focus on further reducing construction time to optimality in compressed space, aligning complexity bounds across diverse query models, lifting advances to 2D and data-rich settings, and unifying the plethora of compressed indexing mechanisms under entropy, repetitiveness, or even learning-theoretic measures (Kempa et al., 2023, De et al., 22 Oct 2025).

The comprehensive survey in (Navarro, 2020) details the above developments, strategies for capitalizing on repetitiveness beyond statistical entropy, practical aspects of index construction, and the challenges that remain in scaling and extending compressed indexing to new data forms and application domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compressed Indexing Structures.