Papers
Topics
Authors
Recent
2000 character limit reached

Run-Length Compressed BWTs

Updated 28 November 2025
  • RLBWT is a compressed data structure that run-length encodes the Burrows-Wheeler Transform, drastically reducing storage requirements for repetitive texts.
  • It enables efficient algorithms with complexities scaling with the number of runs rather than the full text size, enhancing index construction and LZ77 parsing.
  • RLBWT techniques are central to applications like genomic data analysis and large-scale text indexing, offering scalable solutions for terabyte-scale datasets.

A run-length compressed Burrows-Wheeler transform (RLBWT) is a succinct, highly repetitive-aware data structure that encodes the Burrows-Wheeler transform (BWT) of a string or a collection of strings via run-length encoding (RLE) of maximal blocks of identical symbols. RLBWTs have become a central mechanism in compressed indexing, genomic data analysis, dictionary compression, and as a bridge between BWT-based and LZ77-based representations. The efficiency of RLBWTs arises from the observation that in highly repetitive texts, the number of runs is orders of magnitude smaller than the input length, enabling near-optimal storage and facilitating compressed algorithms whose working memory, construction time, and query performance scale with the number of BWT runs rather than the raw text size.

1. Formal Definition and Key Properties

Given a string SΣnS\in\Sigma^n terminated by a unique end-marker (e.g., $\$$), its suffix arraySA[1..n]SA[1..n]orders all suffixes ofSSlexicographically. The BWT,L[1..n]L[1..n], is defined asL[i]=S[SA[i]1]L[i] = S[SA[i]-1], with$S[0]=\$%%%%7%%%%L[i..j]=a^e%%%%8%%%%e=j-i+1%%%%9%%%%L[i-1]\neq a%%%%10%%%%L[j+1]\neq a%%%%11%%%%R%%%%12%%%%\langle(c_1,\ell_1),...,(c_R,\ell_R)\rangle%%%%13%%%%c_k%%%%14%%%%k%%%%15%%%%\ell_k%%%%16%%%%\sum_{k=1}^R \ell_k = n(<ahref="/papers/1510.06257"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Prezzaetal.,2015</a>).</p><p>Inhighlyrepetitivedata, (<a href="/papers/1510.06257" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Prezza et al., 2015</a>).</p> <p>In highly repetitive data, R \ll n,andthiscompressivenessmeansthatboththerepresentationanddownstreamcomputationcanoftenbeeffectedin, and this compressiveness means that both the representation and downstream computation can often be effected in O(R)or or O(R \log n)bits,exponentiallysmallerthannaı¨verepresentations.</p><h2class=paperheadingid=constructiontechniquesandalgorithms>2.ConstructionTechniquesandAlgorithms</h2><p>EfficientRLBWTconstructionmustmeettwomainobjectives:minimizeworkingspace(ideallyscalingwith bits, exponentially smaller than naïve representations.</p> <h2 class='paper-heading' id='construction-techniques-and-algorithms'>2. Construction Techniques and Algorithms</h2> <p>Efficient RLBWT construction must meet two main objectives: minimize working space (ideally scaling with R)andminimize,wherepossible,dependenceon) and minimize, where possible, dependence on n,thetextlength.</p><p><strong>DynamicRLBWTDataStructures:</strong></p><p>Theconstructionalgorithmin(<ahref="/papers/1510.06257"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Prezzaetal.,2015</a>)maintainsadynamicRLBWTfor, the text length.</p> <p><strong>Dynamic RLBWT Data Structures:</strong></p> <p>The construction algorithm in (<a href="/papers/1510.06257" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Prezza et al., 2015</a>) maintains a dynamic RLBWT for \widetilde{S} =$ reverse(#$T),supportingrank,select,access,andinsertin), supporting rank, select, access, and insert in O(\log n)time,using time, using O(R \log n)bits.Itreads bits. It reads Slefttoright,insertingeachnewcharacteratposition left-to-right, inserting each new character at position \mathrm{LF}^j(0)(usingLFmapping),andmaintainsrunboundariesandbitvectorsmarkingrunstartsandpercharacterrunboundaries.</p><p><strong>Complexity:</strong></p><ul><li>Time: (using LF-mapping), and maintains run boundaries and bit-vectors marking run starts and per-character run boundaries.</p> <p><strong>Complexity:</strong></p> <ul> <li>Time: O(n \log R)</li><li>Space:</li> <li>Space: O(R \log n)bits(workingspace)</li><li>Inhighlyrepetitivecases( bits (working space)</li> <li>In highly repetitive cases (R = O(1)),thespacecanbe), the space can be O(\log n)bitsexponentiallysmallerthan bits—exponentially smaller than n.</li></ul><p>Furtherimprovementsleveragestaticarraysandtableabstractionstoreplacedynamicstructures,achievingadditionalreductionsinworkingmemoryinpracticalsettings(<ahref="/papers/2202.07885"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Nishimotoetal.,2022</a>).The.</li> </ul> <p>Further improvements leverage static arrays and table abstractions to replace dynamic structures, achieving additional reductions in working memory in practical settings (<a href="/papers/2202.07885" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Nishimoto et al., 2022</a>). The rcompalgorithmachievesoptimal-comp algorithm achieves optimal O(n + r \log r)timeand time and O(r \log n)bits,andsupportsconstructionforverylarge(terabytescale)genomesorpangenomiccollections.</p><h2class=paperheadingid=combinatorialboundsandcompressiveness>3.CombinatorialBoundsandCompressiveness</h2><p>ThecompressivenessofRLBWTisgovernedbyupperboundsrelating bits, and supports construction for very large (terabyte-scale) genomes or pangenomic collections.</p> <h2 class='paper-heading' id='combinatorial-bounds-and-compressiveness'>3. Combinatorial Bounds and Compressiveness</h2> <p>The compressiveness of RLBWT is governed by upper bounds relating Rtoexternalmeasuresofrepetitiveness,inparticular,thesize to external measures of repetitiveness, in particular, the size zoftheLZ77factorization.</p><p><strong>CoreTheorems:</strong></p><ul><li>Forall of the LZ77 factorization.</p> <p><strong>Core Theorems:</strong></p> <ul> <li>For all Toflength of length nandLZ77size and LZ77 size z,, R = O(z(\log n)^2)(<ahref="/papers/1910.10631"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Kempaetal.,2019</a>).</li><li>For (<a href="/papers/1910.10631" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Kempa et al., 2019</a>).</li> <li>For qthpowerfree-th power-free TofLZ77size of LZ77 size z,, R \leq 73\cdot(\log_2 n)\cdot(z+2)^2(<ahref="/papers/2002.06265"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">PapeLange,2020</a>).</li><li> (<a href="/papers/2002.06265" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Pape-Lange, 2020</a>).</li> <li>Rand and zarealwayswithinan are always within an O(\mathrm{polylog}\, n)factorofeachother.</li><li>Foranystring factor of each other.</li> <li>For any string w(with (with \rho(w)thenumberoforiginalruns), the number of original runs), \rho(\mathrm{BWT}(w)) \le 2 \rho(w)theRLBWTnevercreatesmorethantwiceasmanyrunsastheoriginalruncount(<ahref="/papers/2411.11298"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Bannaietal.,18Nov2024</a>).</li></ul><p>Thesecombinatorialresultsdemonstratethatforanyhighlyrepetitivestring(where—the RLBWT never creates more than twice as many runs as the original run-count (<a href="/papers/2411.11298" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Bannai et al., 18 Nov 2024</a>).</li> </ul> <p>These combinatorial results demonstrate that for any highly repetitive string (where z \ll n),RLBWTdeliversasuccinct,nearoptimalcompressedrepresentation.Thisenablescompressedindexes(e.g.,the), RLBWT delivers a succinct, near-optimal compressed representation. This enables compressed indexes (e.g., the rindex)tostoreandquerydatausingonly-index) to store and query data using only O(R\,\mathrm{polylog}\,n)$ space.</p> <h2 class='paper-heading' id='rlbwt-in-lz77-computation-and-self-indexing'>4. RLBWT in LZ77 Computation and Self-Indexing</h2> <p>A central application of RLBWTs is computing the LZ77 factorization in compressed space. The key insight from (<a href="/papers/1510.06257" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Prezza et al., 2015</a>) is that, after constructing the RLBWT for reverse(#$T),onecancomputetheLZ77parsingby:</p><ul><li>MaintainingthecurrentphraseprefixlengthandBWTintervalofthereversedprefix.</li><li>UsingatmosttwoSAsamplesperrun(asuffixarraysamplestructure),enablingextensionandlocationofpreviousprefixes.</li><li>Performingallnecessarychecksandupdatesin), one can compute the LZ77 parsing by:</p> <ul> <li>Maintaining the current phrase-prefix length and BWT interval of the reversed prefix.</li> <li>Using at most two SA samples per run (a “suffix-array-sample” structure), enabling extension and location of previous prefixes.</li> <li>Performing all necessary checks and updates in O(\log n)timeperstep,with time per step, with O(R \log n)bitsofworkspace.</li></ul><p><strong>Consequences:</strong></p><ul><li>LZ77parsingisavailablein bits of workspace.</li> </ul> <p><strong>Consequences:</strong></p> <ul> <li>LZ77 parsing is available in O(n \log R)timeand time and O(R \log n)bits,sobothparsingandindexingarepossibleincompressed,repetitionawarespace.</li><li>SelfindexesthatcombineanRLBWTwithLZ77and bits, so both parsing and indexing are possible in compressed, repetition-aware space.</li> <li>Self-indexes that combine an RLBWT with LZ77 and O(z)supplementalpointerscanbebuiltin supplemental pointers can be built in O(R+z)words,whichisasymptoticallyoptimal(outputs words, which is asymptotically optimal (outputs zphrasesandretains phrases and retains Rruns).</li><li>Forrepetitivedata, runs).</li> <li>For repetitive data, Rand and zremainsmall,andbothindexingandparsingremainefficient.</li></ul><h2class=paperheadingid=largescalemergingandscalableimplementation>5.LargeScaleMergingandScalableImplementation</h2><p>Handlingaggregatedatasets(e.g.,terabasescalecollections)requiresscalablemergingofmultipleRLBWTs:</p><ul><li><strong>Highthroughputmerging:</strong>Thealgorithmin(<ahref="/papers/1511.00898"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Sireˊn,2015</a>)partitionsacollectioninto remain small, and both indexing and parsing remain efficient.</li> </ul> <h2 class='paper-heading' id='large-scale-merging-and-scalable-implementation'>5. Large-Scale Merging and Scalable Implementation</h2> <p>Handling aggregate datasets (e.g., terabase-scale collections) requires scalable merging of multiple RLBWTs:</p> <ul> <li><strong>High-throughput merging:</strong> The algorithm in (<a href="/papers/1511.00898" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Sirén, 2015</a>) partitions a collection into psubcollections,buildstheRLBWTofeachindependently,andthenmergesthemusingasuccinct,bitvectormediatedmergingprocess.Thetotaltimepermergeis subcollections, builds the RLBWT of each independently, and then merges them using a succinct, bitvector-mediated merging process. The total time per merge is O(n t_r)where where t_risthetimetoanswerasinglerankquery;overallthemergingis is the time to answer a single rank query; overall the merging is O((p+t_r) n).</li><li><strong>Practicalimplementation:</strong>Utilizingblockalignment,twolevelarrays,memorymappedbuffers,andmultithreadingallowsforthemergingof.</li> <li><strong>Practical implementation:</strong> Utilizing block alignment, two-level arrays, memory-mapped buffers, and multithreading allows for the merging of 600Gbp/daywithonly Gbp/day with only 30GBmemoryoverhead,supportingterabasescaleFMindexesoncommodityhardware.</li><li><strong>Adaptivemerging:</strong>MorerecentadvancesincorporatemeasuressuchasthesumofLCPsatblockboundariestoachievemergetimesof GB memory overhead, supporting terabase-scale FM-indexes on commodity hardware.</li> <li><strong>Adaptive merging:</strong> More recent advances incorporate measures such as the sum of LCPs at block boundaries to achieve merge times of \tilde{O}(L + \sigma + R),where, where Lreflectsthetrueoverlapbetweensubcollectionsandcanbesmallevenforlargeinput(<ahref="/papers/2511.16953"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Gagie,21Nov2025</a>).</li></ul><p>Table:ComplexityComparisonofRLBWTConstruction/Merging</p><divclass=overflowxautomaxwfullmy4><tableclass=tablebordercollapsewfullstyle=tablelayout:fixed><thead><tr><th>Algorithm</th><th>TimeComplexity</th><th>SpaceComplexity</th><th>Applicability</th></tr></thead><tbody><tr><td>Dynamiconline(<ahref="/papers/1510.06257"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Prezzaetal.,2015</a>)</td><td> reflects the true overlap between subcollections and can be small even for large input (<a href="/papers/2511.16953" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Gagie, 21 Nov 2025</a>).</li> </ul> <p>Table: Complexity Comparison of RLBWT Construction/Merging</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Algorithm</th> <th>Time Complexity</th> <th>Space Complexity</th> <th>Applicability</th> </tr> </thead><tbody><tr> <td>Dynamic online (<a href="/papers/1510.06257" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Prezza et al., 2015</a>)</td> <td>O(n \log R)</td><td></td> <td>O(R \log n)bits</td><td>Streaminginput,repetitivetexts</td></tr><tr><td>rcomp(<ahref="/papers/2202.07885"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Nishimotoetal.,2022</a>)</td><td> bits</td> <td>Streaming input, repetitive texts</td> </tr> <tr> <td>r-comp (<a href="/papers/2202.07885" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Nishimoto et al., 2022</a>)</td> <td>O(n + r\log r)</td><td></td> <td>O(r \log n)bits</td><td>Pangenomic,largescaleinputs</td></tr><tr><td>Sireˊnmerging(<ahref="/papers/1511.00898"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Sireˊn,2015</a>)</td><td> bits</td> <td>Pan-genomic, large-scale inputs</td> </tr> <tr> <td>Sirén merging (<a href="/papers/1511.00898" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Sirén, 2015</a>)</td> <td>O((p + t_r) n)</td><td></td> <td>O(r\log n)bits</td><td>Terabasescalecollections</td></tr><tr><td>Adaptivemerge(<ahref="/papers/2511.16953"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Gagie,21Nov2025</a>)</td><td> bits</td> <td>Terabase-scale collections</td> </tr> <tr> <td>Adaptive merge (<a href="/papers/2511.16953" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Gagie, 21 Nov 2025</a>)</td> <td>\tilde{O}(L + \sigma + R)</td><td></td> <td>O(R)</td><td>Setsofcircular/repetitivestrings</td></tr></tbody></table></div><h2class=paperheadingid=influenceofalphabetorderingandheuristics>6.InfluenceofAlphabetOrderingandHeuristics</h2><p>ThealphabetorderingusedduringBWTcomputationstronglyaffectsthenumberofrunsandhencethecompressibilityoftheRLBWT.TheminimalrunorderingproblemisNPcompleteandAPXhard(<ahref="/papers/2401.16435"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Majoretal.,26Jan2024</a>).</p><p><strong>Keyfindings:</strong></p><ul><li>Forsmallalphabets,exhaustivesearchispossible;forlarge</td> <td>Sets of circular/repetitive strings</td> </tr> </tbody></table></div><h2 class='paper-heading' id='influence-of-alphabet-ordering-and-heuristics'>6. Influence of Alphabet Ordering and Heuristics</h2> <p>The alphabet ordering used during BWT computation strongly affects the number of runs—and hence the compressibility—of the RLBWT. The minimal-run ordering problem is NP-complete and APX-hard (<a href="/papers/2401.16435" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Major et al., 26 Jan 2024</a>).</p> <p><strong>Key findings:</strong></p> <ul> <li>For small alphabets, exhaustive search is possible; for large \sigma,heuristicsearchisnecessary.</li><li>Firstimprovementlocalsearch(usingSwaporInsertneighborhoodsandavarietyofinitializationssuchasASCII,frequency,orfirstappearanceorder)rapidlyimprovescompressibility,oftenreducingthenumberofrunsby13percentagepointscomparedtonaiveASCIIorderings.</li><li>Inpracticalpipelines,sampling, heuristic search is necessary.</li> <li>First-improvement local search (using Swap or Insert neighborhoods and a variety of initializations such as ASCII, frequency, or first-appearance order) rapidly improves compressibility, often reducing the number of runs by 1–3 percentage points compared to naive ASCII orderings.</li> <li>In practical pipelines, sampling O(10^3)permutationsonsmalltextsamplescanprovidenearoptimalalphabetorderings,makingasignificantimpactatscaleforlargedatasets.</li></ul><h2class=paperheadingid=practicalapplicationsandbroaderimpacts>7.PracticalApplicationsandBroaderImpacts</h2><p>RLBWTsunderpinstateoftheartcompressedindexesforpangenomics,largedocumentversioningsystems,andothermassivelyrepetitivecorpora:</p><ul><li><strong>Referencefreegenomics:</strong>Storeandindextensofbillionsofsequencingreadsefficientlyinmemory(<ahref="/papers/1511.00898"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Sireˊn,2015</a>).</li><li><strong>Compressedselfindexes:</strong>CombineRLBWTandLZ77parsingplusminimalauxiliarydatatosupportefficientlocate/extractqueriesin permutations on small text samples can provide near-optimal alphabet orderings, making a significant impact at scale for large datasets.</li> </ul> <h2 class='paper-heading' id='practical-applications-and-broader-impacts'>7. Practical Applications and Broader Impacts</h2> <p>RLBWTs underpin state-of-the-art compressed indexes for pan-genomics, large document versioning systems, and other massively repetitive corpora:</p> <ul> <li><strong>Reference-free genomics:</strong> Store and index tens of billions of sequencing reads efficiently in-memory (<a href="/papers/1511.00898" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Sirén, 2015</a>).</li> <li><strong>Compressed self-indexes:</strong> Combine RLBWT and LZ77 parsing plus minimal auxiliary data to support efficient locate/extract queries in O(R+z)space(<ahref="/papers/1510.06257"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Prezzaetal.,2015</a>).</li><li><strong>Streamingandonlineprocessing:</strong>RLBWTsallowLZ77parsingandothercompressedcomputationsinstreamingsettings,suitableforonepassalgorithms(<ahref="/papers/1510.06257"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Prezzaetal.,2015</a>).</li><li><strong>Integrationwithgrammarbasedindexes:</strong>Hybridapproachesleveraginggrammarcompression(e.g.,GCIS)followedbyRLBWTsignificantlyreduceruncountandimprovequerytimes,especiallyforlongpatternmatchesonrepetitivedata(<ahref="/papers/2110.01181"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Dengetal.,2021</a>).</li></ul><p>TherobustrelationshipbetweentheRLBWTruncount space (<a href="/papers/1510.06257" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Prezza et al., 2015</a>).</li> <li><strong>Streaming and online processing:</strong> RLBWTs allow LZ77 parsing and other compressed computations in streaming settings, suitable for one-pass algorithms (<a href="/papers/1510.06257" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Prezza et al., 2015</a>).</li> <li><strong>Integration with grammar-based indexes:</strong> Hybrid approaches leveraging grammar compression (e.g., GCIS) followed by RLBWT significantly reduce run-count and improve query times, especially for long pattern matches on repetitive data (<a href="/papers/2110.01181" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Deng et al., 2021</a>).</li> </ul> <p>The robust relationship between the RLBWT run-count RandLZ77size and LZ77 size z$ (and other repetitiveness measures) ensures that RLBWT-based methods are provably efficient on all compressible inputs. Theoretical advances (e.g., (Kempa et al., 2019, Bannai et al., 18 Nov 2024)) provide strong guarantees: no more than a polylogarithmic overhead is incurred in the worst case when transforming between BWT and dictionary-based compressors.


References:

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Run-Length Compressed Burrows-Wheeler Transforms (RLBWTs).