RLBWT is a compressed data structure that run-length encodes the Burrows-Wheeler Transform, drastically reducing storage requirements for repetitive texts.
It enables efficient algorithms with complexities scaling with the number of runs rather than the full text size, enhancing index construction and LZ77 parsing.
RLBWT techniques are central to applications like genomic data analysis and large-scale text indexing, offering scalable solutions for terabyte-scale datasets.
A run-length compressed Burrows-Wheeler transform (RLBWT) is a succinct, highly repetitive-aware data structure that encodes the Burrows-Wheeler transform (BWT) of a string or a collection of strings via run-length encoding (RLE) of maximal blocks of identical symbols. RLBWTs have become a central mechanism in compressed indexing, genomic data analysis, dictionary compression, and as a bridge between BWT-based and LZ77-based representations. The efficiency of RLBWTs arises from the observation that in highly repetitive texts, the number of runs is orders of magnitude smaller than the input length, enabling near-optimal storage and facilitating compressed algorithms whose working memory, construction time, and query performance scale with the number of BWT runs rather than the raw text size.
1. Formal Definition and Key Properties
Given a string S∈Σn terminated by a unique end-marker (e.g., $\$$), its suffix arraySA[1..n]orders all suffixes ofSlexicographically. The BWT,L[1..n], is defined asL[i]=S[SA[i]−1], with$S[0]=\$%%%%7%%%%L[i..j]=a^e%%%%8%%%%e=j-i+1%%%%9%%%%L[i-1]\neq a%%%%10%%%%L[j+1]\neq a%%%%11%%%%R%%%%12%%%%\langle(c_1,\ell_1),...,(c_R,\ell_R)\rangle%%%%13%%%%c_k%%%%14%%%%k%%%%15%%%%\ell_k%%%%16%%%%\sum_{k=1}^R \ell_k = n(<ahref="/papers/1510.06257"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Prezzaetal.,2015</a>).</p><p>Inhighlyrepetitivedata,R \ll n,andthiscompressivenessmeansthatboththerepresentationanddownstreamcomputationcanoftenbeeffectedinO(R)orO(R \log n)bits,exponentiallysmallerthannaı¨verepresentations.</p><h2class=′paper−heading′id=′construction−techniques−and−algorithms′>2.ConstructionTechniquesandAlgorithms</h2><p>EfficientRLBWTconstructionmustmeettwomainobjectives:minimizeworkingspace(ideallyscalingwithR)andminimize,wherepossible,dependenceonn,thetextlength.</p><p><strong>DynamicRLBWTDataStructures:</strong></p><p>Theconstructionalgorithmin(<ahref="/papers/1510.06257"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Prezzaetal.,2015</a>)maintainsadynamicRLBWTfor\widetilde{S} =$ reverse(#$T),supportingrank,select,access,andinsertinO(\log n)time,usingO(R \log n)bits.ItreadsSleft−to−right,insertingeachnewcharacteratposition\mathrm{LF}^j(0)(usingLF−mapping),andmaintainsrunboundariesandbit−vectorsmarkingrunstartsandper−characterrunboundaries.</p><p><strong>Complexity:</strong></p><ul><li>Time:O(n \log R)</li><li>Space:O(R \log n)bits(workingspace)</li><li>Inhighlyrepetitivecases(R = O(1)),thespacecanbeO(\log n)bits—exponentiallysmallerthann.</li></ul><p>Furtherimprovementsleveragestaticarraysandtableabstractionstoreplacedynamicstructures,achievingadditionalreductionsinworkingmemoryinpracticalsettings(<ahref="/papers/2202.07885"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Nishimotoetal.,2022</a>).Ther−compalgorithmachievesoptimalO(n + r \log r)timeandO(r \log n)bits,andsupportsconstructionforverylarge(terabyte−scale)genomesorpangenomiccollections.</p><h2class=′paper−heading′id=′combinatorial−bounds−and−compressiveness′>3.CombinatorialBoundsandCompressiveness</h2><p>ThecompressivenessofRLBWTisgovernedbyupperboundsrelatingRtoexternalmeasuresofrepetitiveness,inparticular,thesizezoftheLZ77factorization.</p><p><strong>CoreTheorems:</strong></p><ul><li>ForallToflengthnandLZ77sizez,R = O(z(\log n)^2)(<ahref="/papers/1910.10631"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Kempaetal.,2019</a>).</li><li>Forq−thpower−freeTofLZ77sizez,R \leq 73\cdot(\log_2 n)\cdot(z+2)^2(<ahref="/papers/2002.06265"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Pape−Lange,2020</a>).</li><li>RandzarealwayswithinanO(\mathrm{polylog}\, n)factorofeachother.</li><li>Foranystringw(with\rho(w)thenumberoforiginalruns),\rho(\mathrm{BWT}(w)) \le 2 \rho(w)—theRLBWTnevercreatesmorethantwiceasmanyrunsastheoriginalrun−count(<ahref="/papers/2411.11298"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Bannaietal.,18Nov2024</a>).</li></ul><p>Thesecombinatorialresultsdemonstratethatforanyhighlyrepetitivestring(wherez \ll n),RLBWTdeliversasuccinct,near−optimalcompressedrepresentation.Thisenablescompressedindexes(e.g.,ther−index)tostoreandquerydatausingonlyO(R\,\mathrm{polylog}\,n)$ space.</p>
<h2 class='paper-heading' id='rlbwt-in-lz77-computation-and-self-indexing'>4. RLBWT in LZ77 Computation and Self-Indexing</h2>
<p>A central application of RLBWTs is computing the LZ77 factorization in compressed space. The key insight from (<a href="/papers/1510.06257" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Prezza et al., 2015</a>) is that, after constructing the RLBWT for reverse(#$T),onecancomputetheLZ77parsingby:</p><ul><li>Maintainingthecurrentphrase−prefixlengthandBWTintervalofthereversedprefix.</li><li>UsingatmosttwoSAsamplesperrun(a“suffix−array−sample”structure),enablingextensionandlocationofpreviousprefixes.</li><li>PerformingallnecessarychecksandupdatesinO(\log n)timeperstep,withO(R \log n)bitsofworkspace.</li></ul><p><strong>Consequences:</strong></p><ul><li>LZ77parsingisavailableinO(n \log R)timeandO(R \log n)bits,sobothparsingandindexingarepossibleincompressed,repetition−awarespace.</li><li>Self−indexesthatcombineanRLBWTwithLZ77andO(z)supplementalpointerscanbebuiltinO(R+z)words,whichisasymptoticallyoptimal(outputszphrasesandretainsRruns).</li><li>Forrepetitivedata,Randzremainsmall,andbothindexingandparsingremainefficient.</li></ul><h2class=′paper−heading′id=′large−scale−merging−and−scalable−implementation′>5.Large−ScaleMergingandScalableImplementation</h2><p>Handlingaggregatedatasets(e.g.,terabase−scalecollections)requiresscalablemergingofmultipleRLBWTs:</p><ul><li><strong>High−throughputmerging:</strong>Thealgorithmin(<ahref="/papers/1511.00898"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Sireˊn,2015</a>)partitionsacollectionintopsubcollections,buildstheRLBWTofeachindependently,andthenmergesthemusingasuccinct,bitvector−mediatedmergingprocess.ThetotaltimepermergeisO(n t_r)wheret_risthetimetoanswerasinglerankquery;overallthemergingisO((p+t_r) n).</li><li><strong>Practicalimplementation:</strong>Utilizingblockalignment,two−levelarrays,memory−mappedbuffers,andmultithreadingallowsforthemergingof600Gbp/daywithonly30GBmemoryoverhead,supportingterabase−scaleFM−indexesoncommodityhardware.</li><li><strong>Adaptivemerging:</strong>MorerecentadvancesincorporatemeasuressuchasthesumofLCPsatblockboundariestoachievemergetimesof\tilde{O}(L + \sigma + R),whereLreflectsthetrueoverlapbetweensubcollectionsandcanbesmallevenforlargeinput(<ahref="/papers/2511.16953"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Gagie,21Nov2025</a>).</li></ul><p>Table:ComplexityComparisonofRLBWTConstruction/Merging</p><divclass=′overflow−x−automax−w−fullmy−4′><tableclass=′tableborder−collapsew−full′style=′table−layout:fixed′><thead><tr><th>Algorithm</th><th>TimeComplexity</th><th>SpaceComplexity</th><th>Applicability</th></tr></thead><tbody><tr><td>Dynamiconline(<ahref="/papers/1510.06257"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Prezzaetal.,2015</a>)</td><td>O(n \log R)</td><td>O(R \log n)bits</td><td>Streaminginput,repetitivetexts</td></tr><tr><td>r−comp(<ahref="/papers/2202.07885"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Nishimotoetal.,2022</a>)</td><td>O(n + r\log r)</td><td>O(r \log n)bits</td><td>Pan−genomic,large−scaleinputs</td></tr><tr><td>Sireˊnmerging(<ahref="/papers/1511.00898"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Sireˊn,2015</a>)</td><td>O((p + t_r) n)</td><td>O(r\log n)bits</td><td>Terabase−scalecollections</td></tr><tr><td>Adaptivemerge(<ahref="/papers/2511.16953"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Gagie,21Nov2025</a>)</td><td>\tilde{O}(L + \sigma + R)</td><td>O(R)</td><td>Setsofcircular/repetitivestrings</td></tr></tbody></table></div><h2class=′paper−heading′id=′influence−of−alphabet−ordering−and−heuristics′>6.InfluenceofAlphabetOrderingandHeuristics</h2><p>ThealphabetorderingusedduringBWTcomputationstronglyaffectsthenumberofruns—andhencethecompressibility—oftheRLBWT.Theminimal−runorderingproblemisNP−completeandAPX−hard(<ahref="/papers/2401.16435"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Majoretal.,26Jan2024</a>).</p><p><strong>Keyfindings:</strong></p><ul><li>Forsmallalphabets,exhaustivesearchispossible;forlarge\sigma,heuristicsearchisnecessary.</li><li>First−improvementlocalsearch(usingSwaporInsertneighborhoodsandavarietyofinitializationssuchasASCII,frequency,orfirst−appearanceorder)rapidlyimprovescompressibility,oftenreducingthenumberofrunsby1–3percentagepointscomparedtonaiveASCIIorderings.</li><li>Inpracticalpipelines,samplingO(10^3)permutationsonsmalltextsamplescanprovidenear−optimalalphabetorderings,makingasignificantimpactatscaleforlargedatasets.</li></ul><h2class=′paper−heading′id=′practical−applications−and−broader−impacts′>7.PracticalApplicationsandBroaderImpacts</h2><p>RLBWTsunderpinstate−of−the−artcompressedindexesforpan−genomics,largedocumentversioningsystems,andothermassivelyrepetitivecorpora:</p><ul><li><strong>Reference−freegenomics:</strong>Storeandindextensofbillionsofsequencingreadsefficientlyin−memory(<ahref="/papers/1511.00898"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Sireˊn,2015</a>).</li><li><strong>Compressedself−indexes:</strong>CombineRLBWTandLZ77parsingplusminimalauxiliarydatatosupportefficientlocate/extractqueriesinO(R+z)space(<ahref="/papers/1510.06257"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Prezzaetal.,2015</a>).</li><li><strong>Streamingandonlineprocessing:</strong>RLBWTsallowLZ77parsingandothercompressedcomputationsinstreamingsettings,suitableforone−passalgorithms(<ahref="/papers/1510.06257"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Prezzaetal.,2015</a>).</li><li><strong>Integrationwithgrammar−basedindexes:</strong>Hybridapproachesleveraginggrammarcompression(e.g.,GCIS)followedbyRLBWTsignificantlyreducerun−countandimprovequerytimes,especiallyforlongpatternmatchesonrepetitivedata(<ahref="/papers/2110.01181"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Dengetal.,2021</a>).</li></ul><p>TherobustrelationshipbetweentheRLBWTrun−countRandLZ77sizez$ (and other repetitiveness measures) ensures that RLBWT-based methods are provably efficient on all compressible inputs. Theoretical advances (e.g., (Kempa et al., 2019, Bannai et al., 18 Nov 2024)) provide strong guarantees: no more than a polylogarithmic overhead is incurred in the worst case when transforming between BWT and dictionary-based compressors.