Run-Length Compressed BWTs
- RLBWT is a compressed data structure that run-length encodes the Burrows-Wheeler Transform, drastically reducing storage requirements for repetitive texts.
- It enables efficient algorithms with complexities scaling with the number of runs rather than the full text size, enhancing index construction and LZ77 parsing.
- RLBWT techniques are central to applications like genomic data analysis and large-scale text indexing, offering scalable solutions for terabyte-scale datasets.
A run-length compressed Burrows-Wheeler transform (RLBWT) is a succinct, highly repetitive-aware data structure that encodes the Burrows-Wheeler transform (BWT) of a string or a collection of strings via run-length encoding (RLE) of maximal blocks of identical symbols. RLBWTs have become a central mechanism in compressed indexing, genomic data analysis, dictionary compression, and as a bridge between BWT-based and LZ77-based representations. The efficiency of RLBWTs arises from the observation that in highly repetitive texts, the number of runs is orders of magnitude smaller than the input length, enabling near-optimal storage and facilitating compressed algorithms whose working memory, construction time, and query performance scale with the number of BWT runs rather than the raw text size.
1. Formal Definition and Key Properties
Given a string terminated by a unique end-marker (e.g., $\$$), its suffix arrayorders all suffixes oflexicographically. The BWT,, is defined as, with$S[0]=\$%%%%7%%%%L[i..j]=a^e%%%%8%%%%e=j-i+1%%%%9%%%%L[i-1]\neq a%%%%10%%%%L[j+1]\neq a%%%%11%%%%R%%%%12%%%%\langle(c_1,\ell_1),...,(c_R,\ell_R)\rangle%%%%13%%%%c_k%%%%14%%%%k%%%%15%%%%\ell_k%%%%16%%%%\sum_{k=1}^R \ell_k = nR \ll nO(R)\$0 bits, exponentially smaller than naïve representations.
2. Construction Techniques and Algorithms
Efficient RLBWT construction must meet two main objectives: minimize working space (ideally scaling with $\$1) and minimize, where possible, dependence on $\$2, the text length.
Dynamic RLBWT Data Structures:
The construction algorithm in (Prezza et al., 2015) maintains a dynamic RLBWT for $\$3 reverse(#$\$4), supporting rank, select, access, and insert in $\$5 time, using $\$6 bits. It reads $\$7 left-to-right, inserting each new character at position $\$8 (using LF-mapping), and maintains run boundaries and bit-vectors marking run starts and per-character run boundaries.
Complexity:
- Time: $\$9
- Space: 0 bits (working space)
- In highly repetitive cases (1), the space can be 2 bits—exponentially smaller than 3.
Further improvements leverage static arrays and table abstractions to replace dynamic structures, achieving additional reductions in working memory in practical settings (Nishimoto et al., 2022). The 4-comp algorithm achieves optimal 5 time and 6 bits, and supports construction for very large (terabyte-scale) genomes or pangenomic collections.
3. Combinatorial Bounds and Compressiveness
The compressiveness of RLBWT is governed by upper bounds relating 7 to external measures of repetitiveness, in particular, the size 8 of the LZ77 factorization.
Core Theorems:
- For all 9 of length 0 and LZ77 size 1, 2 (Kempa et al., 2019).
- For 3-th power-free 4 of LZ77 size 5, 6 (Pape-Lange, 2020).
- 7 and 8 are always within an 9 factor of each other.
- For any string 0 (with 1 the number of original runs), 2—the RLBWT never creates more than twice as many runs as the original run-count (Bannai et al., 2024).
These combinatorial results demonstrate that for any highly repetitive string (where 3), RLBWT delivers a succinct, near-optimal compressed representation. This enables compressed indexes (e.g., the 4-index) to store and query data using only 5 space.
4. RLBWT in LZ77 Computation and Self-Indexing
A central application of RLBWTs is computing the LZ77 factorization in compressed space. The key insight from (Prezza et al., 2015) is that, after constructing the RLBWT for reverse(#6), one can compute the LZ77 parsing by:
- Maintaining the current phrase-prefix length and BWT interval of the reversed prefix.
- Using at most two SA samples per run (a “suffix-array-sample” structure), enabling extension and location of previous prefixes.
- Performing all necessary checks and updates in 7 time per step, with 8 bits of workspace.
Consequences:
- LZ77 parsing is available in 9 time and 0 bits, so both parsing and indexing are possible in compressed, repetition-aware space.
- Self-indexes that combine an RLBWT with LZ77 and 1 supplemental pointers can be built in 2 words, which is asymptotically optimal (outputs 3 phrases and retains 4 runs).
- For repetitive data, 5 and 6 remain small, and both indexing and parsing remain efficient.
5. Large-Scale Merging and Scalable Implementation
Handling aggregate datasets (e.g., terabase-scale collections) requires scalable merging of multiple RLBWTs:
- High-throughput merging: The algorithm in (Sirén, 2015) partitions a collection into 7 subcollections, builds the RLBWT of each independently, and then merges them using a succinct, bitvector-mediated merging process. The total time per merge is 8 where 9 is the time to answer a single rank query; overall the merging is 0.
- Practical implementation: Utilizing block alignment, two-level arrays, memory-mapped buffers, and multithreading allows for the merging of 1 Gbp/day with only 2 GB memory overhead, supporting terabase-scale FM-indexes on commodity hardware.
- Adaptive merging: More recent advances incorporate measures such as the sum of LCPs at block boundaries to achieve merge times of 3, where 4 reflects the true overlap between subcollections and can be small even for large input (Gagie, 21 Nov 2025).
Table: Complexity Comparison of RLBWT Construction/Merging
| Algorithm | Time Complexity | Space Complexity | Applicability |
|---|---|---|---|
| Dynamic online (Prezza et al., 2015) | 5 | 6 bits | Streaming input, repetitive texts |
| r-comp (Nishimoto et al., 2022) | 7 | 8 bits | Pan-genomic, large-scale inputs |
| Sirén merging (Sirén, 2015) | 9 | 0 bits | Terabase-scale collections |
| Adaptive merge (Gagie, 21 Nov 2025) | 1 | 2 | Sets of circular/repetitive strings |
6. Influence of Alphabet Ordering and Heuristics
The alphabet ordering used during BWT computation strongly affects the number of runs—and hence the compressibility—of the RLBWT. The minimal-run ordering problem is NP-complete and APX-hard (Major et al., 2024).
Key findings:
- For small alphabets, exhaustive search is possible; for large 3, heuristic search is necessary.
- First-improvement local search (using Swap or Insert neighborhoods and a variety of initializations such as ASCII, frequency, or first-appearance order) rapidly improves compressibility, often reducing the number of runs by 1–3 percentage points compared to naive ASCII orderings.
- In practical pipelines, sampling 4 permutations on small text samples can provide near-optimal alphabet orderings, making a significant impact at scale for large datasets.
7. Practical Applications and Broader Impacts
RLBWTs underpin state-of-the-art compressed indexes for pan-genomics, large document versioning systems, and other massively repetitive corpora:
- Reference-free genomics: Store and index tens of billions of sequencing reads efficiently in-memory (Sirén, 2015).
- Compressed self-indexes: Combine RLBWT and LZ77 parsing plus minimal auxiliary data to support efficient locate/extract queries in 5 space (Prezza et al., 2015).
- Streaming and online processing: RLBWTs allow LZ77 parsing and other compressed computations in streaming settings, suitable for one-pass algorithms (Prezza et al., 2015).
- Integration with grammar-based indexes: Hybrid approaches leveraging grammar compression (e.g., GCIS) followed by RLBWT significantly reduce run-count and improve query times, especially for long pattern matches on repetitive data (Deng et al., 2021).
The robust relationship between the RLBWT run-count 6 and LZ77 size 7 (and other repetitiveness measures) ensures that RLBWT-based methods are provably efficient on all compressible inputs. Theoretical advances (e.g., (Kempa et al., 2019, Bannai et al., 2024)) provide strong guarantees: no more than a polylogarithmic overhead is incurred in the worst case when transforming between BWT and dictionary-based compressors.
References:
- (Prezza et al., 2015, Sirén, 2015, Nishimoto et al., 2022, Kempa et al., 2019, Pape-Lange, 2020, Deng et al., 2021, Gagie, 21 Nov 2025, Major et al., 2024, Bannai et al., 2024)