Papers
Topics
Authors
Recent
Search
2000 character limit reached

Run-Length Compressed BWTs

Updated 28 November 2025
  • RLBWT is a compressed data structure that run-length encodes the Burrows-Wheeler Transform, drastically reducing storage requirements for repetitive texts.
  • It enables efficient algorithms with complexities scaling with the number of runs rather than the full text size, enhancing index construction and LZ77 parsing.
  • RLBWT techniques are central to applications like genomic data analysis and large-scale text indexing, offering scalable solutions for terabyte-scale datasets.

A run-length compressed Burrows-Wheeler transform (RLBWT) is a succinct, highly repetitive-aware data structure that encodes the Burrows-Wheeler transform (BWT) of a string or a collection of strings via run-length encoding (RLE) of maximal blocks of identical symbols. RLBWTs have become a central mechanism in compressed indexing, genomic data analysis, dictionary compression, and as a bridge between BWT-based and LZ77-based representations. The efficiency of RLBWTs arises from the observation that in highly repetitive texts, the number of runs is orders of magnitude smaller than the input length, enabling near-optimal storage and facilitating compressed algorithms whose working memory, construction time, and query performance scale with the number of BWT runs rather than the raw text size.

1. Formal Definition and Key Properties

Given a string SΣnS\in\Sigma^n terminated by a unique end-marker (e.g., $\$$), its suffix arraySA[1..n]SA[1..n]orders all suffixes ofSSlexicographically. The BWT,L[1..n]L[1..n], is defined asL[i]=S[SA[i]1]L[i] = S[SA[i]-1], with$S[0]=\$%%%%7%%%%L[i..j]=a^e%%%%8%%%%e=j-i+1%%%%9%%%%L[i-1]\neq a%%%%10%%%%L[j+1]\neq a%%%%11%%%%R%%%%12%%%%\langle(c_1,\ell_1),...,(c_R,\ell_R)\rangle%%%%13%%%%c_k%%%%14%%%%k%%%%15%%%%\ell_k%%%%16%%%%\sum_{k=1}^R \ell_k = n(<ahref="/papers/1510.06257"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Prezzaetal.,2015</a>).</p><p>Inhighlyrepetitivedata, (<a href="/papers/1510.06257" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Prezza et al., 2015</a>).</p> <p>In highly repetitive data, R \ll n,andthiscompressivenessmeansthatboththerepresentationanddownstreamcomputationcanoftenbeeffectedin, and this compressiveness means that both the representation and downstream computation can often be effected in O(R)or or \$0 bits, exponentially smaller than naïve representations.

2. Construction Techniques and Algorithms

Efficient RLBWT construction must meet two main objectives: minimize working space (ideally scaling with $\$1) and minimize, where possible, dependence on $\$2, the text length.

Dynamic RLBWT Data Structures:

The construction algorithm in (Prezza et al., 2015) maintains a dynamic RLBWT for $\$3 reverse(#$\$4), supporting rank, select, access, and insert in $\$5 time, using $\$6 bits. It reads $\$7 left-to-right, inserting each new character at position $\$8 (using LF-mapping), and maintains run boundaries and bit-vectors marking run starts and per-character run boundaries.

Complexity:

  • Time: $\$9
  • Space: ),itssuffixarray), its suffix array0 bits (working space)
  • In highly repetitive cases (),itssuffixarray), its suffix array1), the space can be ),itssuffixarray), its suffix array2 bits—exponentially smaller than ),itssuffixarray), its suffix array3.

Further improvements leverage static arrays and table abstractions to replace dynamic structures, achieving additional reductions in working memory in practical settings (Nishimoto et al., 2022). The ),itssuffixarray), its suffix array4-comp algorithm achieves optimal ),itssuffixarray), its suffix array5 time and ),itssuffixarray), its suffix array6 bits, and supports construction for very large (terabyte-scale) genomes or pangenomic collections.

3. Combinatorial Bounds and Compressiveness

The compressiveness of RLBWT is governed by upper bounds relating ),itssuffixarray), its suffix array7 to external measures of repetitiveness, in particular, the size ),itssuffixarray), its suffix array8 of the LZ77 factorization.

Core Theorems:

  • For all ),itssuffixarray), its suffix array9 of length ordersallsuffixesoforders all suffixes of0 and LZ77 size ordersallsuffixesoforders all suffixes of1, ordersallsuffixesoforders all suffixes of2 (Kempa et al., 2019).
  • For ordersallsuffixesoforders all suffixes of3-th power-free ordersallsuffixesoforders all suffixes of4 of LZ77 size ordersallsuffixesoforders all suffixes of5, ordersallsuffixesoforders all suffixes of6 (Pape-Lange, 2020).
  • ordersallsuffixesoforders all suffixes of7 and ordersallsuffixesoforders all suffixes of8 are always within an ordersallsuffixesoforders all suffixes of9 factor of each other.
  • For any string lexicographically.TheBWT,lexicographically. The BWT,0 (with lexicographically.TheBWT,lexicographically. The BWT,1 the number of original runs), lexicographically.TheBWT,lexicographically. The BWT,2—the RLBWT never creates more than twice as many runs as the original run-count (Bannai et al., 2024).

These combinatorial results demonstrate that for any highly repetitive string (where lexicographically.TheBWT,lexicographically. The BWT,3), RLBWT delivers a succinct, near-optimal compressed representation. This enables compressed indexes (e.g., the lexicographically.TheBWT,lexicographically. The BWT,4-index) to store and query data using only lexicographically.TheBWT,lexicographically. The BWT,5 space.

4. RLBWT in LZ77 Computation and Self-Indexing

A central application of RLBWTs is computing the LZ77 factorization in compressed space. The key insight from (Prezza et al., 2015) is that, after constructing the RLBWT for reverse(#lexicographically.TheBWT,lexicographically. The BWT,6), one can compute the LZ77 parsing by:

  • Maintaining the current phrase-prefix length and BWT interval of the reversed prefix.
  • Using at most two SA samples per run (a “suffix-array-sample” structure), enabling extension and location of previous prefixes.
  • Performing all necessary checks and updates in lexicographically.TheBWT,lexicographically. The BWT,7 time per step, with lexicographically.TheBWT,lexicographically. The BWT,8 bits of workspace.

Consequences:

  • LZ77 parsing is available in lexicographically.TheBWT,lexicographically. The BWT,9 time and ,isdefinedas, is defined as0 bits, so both parsing and indexing are possible in compressed, repetition-aware space.
  • Self-indexes that combine an RLBWT with LZ77 and ,isdefinedas, is defined as1 supplemental pointers can be built in ,isdefinedas, is defined as2 words, which is asymptotically optimal (outputs ,isdefinedas, is defined as3 phrases and retains ,isdefinedas, is defined as4 runs).
  • For repetitive data, ,isdefinedas, is defined as5 and ,isdefinedas, is defined as6 remain small, and both indexing and parsing remain efficient.

5. Large-Scale Merging and Scalable Implementation

Handling aggregate datasets (e.g., terabase-scale collections) requires scalable merging of multiple RLBWTs:

  • High-throughput merging: The algorithm in (Sirén, 2015) partitions a collection into ,isdefinedas, is defined as7 subcollections, builds the RLBWT of each independently, and then merges them using a succinct, bitvector-mediated merging process. The total time per merge is ,isdefinedas, is defined as8 where ,isdefinedas, is defined as9 is the time to answer a single rank query; overall the merging is ,with, with0.
  • Practical implementation: Utilizing block alignment, two-level arrays, memory-mapped buffers, and multithreading allows for the merging of ,with, with1 Gbp/day with only ,with, with2 GB memory overhead, supporting terabase-scale FM-indexes on commodity hardware.
  • Adaptive merging: More recent advances incorporate measures such as the sum of LCPs at block boundaries to achieve merge times of ,with, with3, where ,with, with4 reflects the true overlap between subcollections and can be small even for large input (Gagie, 21 Nov 2025).

Table: Complexity Comparison of RLBWT Construction/Merging

Algorithm Time Complexity Space Complexity Applicability
Dynamic online (Prezza et al., 2015) ,with, with5 ,with, with6 bits Streaming input, repetitive texts
r-comp (Nishimoto et al., 2022) ,with, with7 ,with, with8 bits Pan-genomic, large-scale inputs
Sirén merging (Sirén, 2015) ,with, with9 %%%%7%%%%L[i..j]=a^e%%%%8%%%%e=j-i+1%%%%9%%%%L[i-1]\neq a%%%%10%%%%L[j+1]\neq a%%%%11%%%%R%%%%12%%%%\langle(c_1,\ell_1),...,(c_R,\ell_R)\rangle%%%%13%%%%c_k%%%%14%%%%k%%%%15%%%%\ell_k%%%%16%%%%\sum_{k=1}^R \ell_k = n0 bits Terabase-scale collections
Adaptive merge (Gagie, 21 Nov 2025) %%%%7%%%%L[i..j]=a^e%%%%8%%%%e=j-i+1%%%%9%%%%L[i-1]\neq a%%%%10%%%%L[j+1]\neq a%%%%11%%%%R%%%%12%%%%\langle(c_1,\ell_1),...,(c_R,\ell_R)\rangle%%%%13%%%%c_k%%%%14%%%%k%%%%15%%%%\ell_k%%%%16%%%%\sum_{k=1}^R \ell_k = n1 %%%%7%%%%L[i..j]=a^e%%%%8%%%%e=j-i+1%%%%9%%%%L[i-1]\neq a%%%%10%%%%L[j+1]\neq a%%%%11%%%%R%%%%12%%%%\langle(c_1,\ell_1),...,(c_R,\ell_R)\rangle%%%%13%%%%c_k%%%%14%%%%k%%%%15%%%%\ell_k%%%%16%%%%\sum_{k=1}^R \ell_k = n2 Sets of circular/repetitive strings

6. Influence of Alphabet Ordering and Heuristics

The alphabet ordering used during BWT computation strongly affects the number of runs—and hence the compressibility—of the RLBWT. The minimal-run ordering problem is NP-complete and APX-hard (Major et al., 2024).

Key findings:

  • For small alphabets, exhaustive search is possible; for large %%%%7%%%%L[i..j]=a^e%%%%8%%%%e=j-i+1%%%%9%%%%L[i-1]\neq a%%%%10%%%%L[j+1]\neq a%%%%11%%%%R%%%%12%%%%\langle(c_1,\ell_1),...,(c_R,\ell_R)\rangle%%%%13%%%%c_k%%%%14%%%%k%%%%15%%%%\ell_k%%%%16%%%%\sum_{k=1}^R \ell_k = n3, heuristic search is necessary.
  • First-improvement local search (using Swap or Insert neighborhoods and a variety of initializations such as ASCII, frequency, or first-appearance order) rapidly improves compressibility, often reducing the number of runs by 1–3 percentage points compared to naive ASCII orderings.
  • In practical pipelines, sampling %%%%7%%%%L[i..j]=a^e%%%%8%%%%e=j-i+1%%%%9%%%%L[i-1]\neq a%%%%10%%%%L[j+1]\neq a%%%%11%%%%R%%%%12%%%%\langle(c_1,\ell_1),...,(c_R,\ell_R)\rangle%%%%13%%%%c_k%%%%14%%%%k%%%%15%%%%\ell_k%%%%16%%%%\sum_{k=1}^R \ell_k = n4 permutations on small text samples can provide near-optimal alphabet orderings, making a significant impact at scale for large datasets.

7. Practical Applications and Broader Impacts

RLBWTs underpin state-of-the-art compressed indexes for pan-genomics, large document versioning systems, and other massively repetitive corpora:

  • Reference-free genomics: Store and index tens of billions of sequencing reads efficiently in-memory (Sirén, 2015).
  • Compressed self-indexes: Combine RLBWT and LZ77 parsing plus minimal auxiliary data to support efficient locate/extract queries in %%%%7%%%%L[i..j]=a^e%%%%8%%%%e=j-i+1%%%%9%%%%L[i-1]\neq a%%%%10%%%%L[j+1]\neq a%%%%11%%%%R%%%%12%%%%\langle(c_1,\ell_1),...,(c_R,\ell_R)\rangle%%%%13%%%%c_k%%%%14%%%%k%%%%15%%%%\ell_k%%%%16%%%%\sum_{k=1}^R \ell_k = n5 space (Prezza et al., 2015).
  • Streaming and online processing: RLBWTs allow LZ77 parsing and other compressed computations in streaming settings, suitable for one-pass algorithms (Prezza et al., 2015).
  • Integration with grammar-based indexes: Hybrid approaches leveraging grammar compression (e.g., GCIS) followed by RLBWT significantly reduce run-count and improve query times, especially for long pattern matches on repetitive data (Deng et al., 2021).

The robust relationship between the RLBWT run-count %%%%7%%%%L[i..j]=a^e%%%%8%%%%e=j-i+1%%%%9%%%%L[i-1]\neq a%%%%10%%%%L[j+1]\neq a%%%%11%%%%R%%%%12%%%%\langle(c_1,\ell_1),...,(c_R,\ell_R)\rangle%%%%13%%%%c_k%%%%14%%%%k%%%%15%%%%\ell_k%%%%16%%%%\sum_{k=1}^R \ell_k = n6 and LZ77 size %%%%7%%%%L[i..j]=a^e%%%%8%%%%e=j-i+1%%%%9%%%%L[i-1]\neq a%%%%10%%%%L[j+1]\neq a%%%%11%%%%R%%%%12%%%%\langle(c_1,\ell_1),...,(c_R,\ell_R)\rangle%%%%13%%%%c_k%%%%14%%%%k%%%%15%%%%\ell_k%%%%16%%%%\sum_{k=1}^R \ell_k = n7 (and other repetitiveness measures) ensures that RLBWT-based methods are provably efficient on all compressible inputs. Theoretical advances (e.g., (Kempa et al., 2019, Bannai et al., 2024)) provide strong guarantees: no more than a polylogarithmic overhead is incurred in the worst case when transforming between BWT and dictionary-based compressors.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Run-Length Compressed Burrows-Wheeler Transforms (RLBWTs).