Papers
Topics
Authors
Recent
2000 character limit reached

Near-Supermaximal Repeats (NSMRs)

Updated 18 November 2025
  • NSMRs are context-sensitive repeats defined by net occurrences not covered by longer repeats, distinguishing them from traditional repeat measures.
  • Efficient RLBWT-based algorithms enable O(1) LF-mapping and range-distinct queries, ensuring fast analysis in highly repetitive texts.
  • Breadth-first enumeration with net occurrence detection achieves O(n) time and O(r) space, establishing novel theoretical bounds in string analysis.

A near-supermaximal repeat (NSMR) is a context-sensitive repeat in a string whose net frequency is positive, where net occurrences are defined as occurrences of a repeat not covered by any strictly longer repeat. NSMRs generalize the supermaximal repeat concept by focusing on the "net frequency" of a repeat, which counts only the occurrences that are not subsumed by other, longer repeats. Efficient enumeration and query frameworks for NSMRs have recently been developed to leverage highly repetitive structure in massive texts, providing new theoretical bounds and practical data structures (Kimura et al., 14 Nov 2025).

1. Formal Definitions and Notation

Let T[1..n]T[1..n] be a string over the alphabet $\Sigma_0 = \Sigma \cup \{\$}$, where \$ is a sentinel appearing only at T[0]T[0] and T[n]T[n]. For any string xx of length mm, define its set of occurrences as:

Occ(x)={i[1..nm+1]T[i..i+m1]=x}.\mathrm{Occ}(x) = \{ i \in [1..n - m + 1] \mid T[i..i+m-1] = x \}.

A string xx is a repeat if Occ(x)2|\mathrm{Occ}(x)| \geq 2.

  • Net occurrence: An occurrence iOcc(x)i \in \mathrm{Occ}(x) is called a net occurrence if it is not covered by any strictly longer repeat.
  • Net frequency: For a repeat xx, NF(x)\mathrm{NF}(x) is the number of its net occurrences:

NF(x)={iOcc(x)i is a net occurrence of x}.\mathrm{NF}(x) = |\{\,i \in \mathrm{Occ}(x) \mid i \text{ is a net occurrence of } x \}|.

  • Near-supermaximal repeat (NSMR): A repeat xx is an NSMR if and only if NF(x)>0\mathrm{NF}(x) > 0; that is, xx has at least one net occurrence.

This framework distinguishes between simple multiplicity of substrings and the subset of those occurrences that are not contained within occurrences of longer repeats, thus capturing "context-sensitive" maximality.

2. Core Data Structures: RLBWT and Associated Operations

NSMR enumeration and query algorithms operate efficiently via data structures based on the run-length encoded Burrows-Wheeler Transform (RLBWT) of TT. The BWT, L[1..n]L[1..n], is run-length encoded as

L=c1d1c2d2crdrL = c_1^{d_1} c_2^{d_2} \ldots c_r^{d_r}

where cjcj+1c_j \neq c_{j+1}, with rr the number of runs (maximal sequences of the same symbol). The pair (c[1..r],d[1..r])(c[1..r], d[1..r]) forms the RLBWT.

Key RLBWT operations include:

  • LF-mapping and FL-inverse in O(1)O(1) time and O(r)O(r) space, leveraging the "move" data structure.
  • Range-distinct queries—enumerating all distinct symbols within L[p..q]L[p..q]—in O(k)O(k) time, with kk the number of distinct symbols and O(r)O(r) preprocessing.

Maintaining all active structures in O(r)O(r) space is essential, as rnr \ll n for highly repetitive texts.

3. Enumeration Algorithm for NSMRs

The enumeration of NSMRs alongside their net occurrences proceeds as a breadth-first traversal of all right-maximal repeats using Weiner (left-extension) links. Each repeat xx is represented as:

repr(x)=(I(x),rlist(x),x)\operatorname{repr}(x) = (I(x), \text{rlist}(x), |x|)

where I(x)=[p..q]I(x) = [p..q] is the SA-interval of xx, rlist(x)={(c,I(xc))crc(x)}\text{rlist}(x) = \{ (c, I(xc)) \mid c \in rc(x) \} is the list of right extensions, and x|x| is the length.

Algorithm steps:

  1. Preprocess RLBWT for O(1)O(1) LF/FL and range-distinct queries.
  2. Use a queue initialized with repr(ε)=([1..n],rlist(ε),0)\operatorname{repr}(\varepsilon) = ([1..n], \text{rlist}(\varepsilon), 0).
  3. For each element in the queue: a. Detect net occurrences: for every singleton right-extension interval [pc..qc][p_c..q_c], if L[i]L[i] is unique in L[p..q]L[p..q] and T[SA[i]..SA[i]+]T[\mathrm{SA}[i]..\mathrm{SA}[i]+\ell] is unique, record SA[i]SA[i] as a net occurrence. NSMRs are precisely repeats with positive net frequency. b. Generate child repeats by extending left with all possible aΣa \in \Sigma found by range-distinct queries. c. Push any child repeat repr(ax)\operatorname{repr}(ax) with rlist(ax)>1|\text{rlist}(ax)| > 1 or positive net frequency into the next queue.

Each edge in the suffix tree is traversed once, with each operation per edge in O(1)O(1) time, resulting in O(n)O(n) overall time and O(r)O(r) space complexity.

4. Data Structures for Net Frequency Queries

An O(r)O(r)-space data structure enables querying the net frequency of any pattern PP in O(P)O(|P|) time after O(n)O(n)-time construction. The process is as follows:

  • Collect the set XX of all NSMRs and their net frequencies during enumeration.
  • Construct a compacted reversed trie TT' of XX:
    • Nodes correspond to suffixes of NSMRs.
    • Edges are labeled by single characters, represented implicitly.
    • Each node stores x|x|, a pointer ixi_x, and NF(x)\mathrm{NF}(x).
  • At each branching node, store a degree-dependent dictionary for O(1)O(1) character lookup.
  • Querying for pattern PP proceeds right-to-left in O(P)O(|P|) time; if PP is found, return NF(P)\mathrm{NF}(P), else $0$.

The trie size is O(r)O(r), leveraging the bound that total net occurrences across all NSMRs is <2r<2r.

5. Theoretical Limits and Connections

A key theoretical result is that the total number of net occurrences is strictly less than $2r$, where rr is the number of runs in the BWT. Specifically, every net occurrence must correspond to a boundary of a run in LL, and each run admits at most two such boundaries:

Total net occurrences2r1<2r\text{Total net occurrences} \le 2r-1 < 2r

This property provides a new upper bound not only for net occurrences but, by duality, also for the number of minimal unique substrings (MUS):

#MUS<2r\#\text{MUS} < 2r

This suggests a close structural link between net occurrences in repeats and the landscape of uniquely identifying substrings.

6. Illustrative Example

Consider the string $T = \mathtt{abcbbcbcabc\$},,n = 12.ItsBWTandrundecompositionare:. Its BWT and run decomposition are:L = \mathtt{cc\$cacabbbbb} withwithr = 7runs(runs (c2\cdot \$\cdot c\cdot a\cdot c\cdot a\cdot b^5).EnumerationidentifiesthreeNSMRs:</p><divclass=overflowxautomaxwfullmy4><tableclass=tablebordercollapsewfullstyle=tablelayout:fixed><thead><tr><th>). Enumeration identifies three NSMRs:</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>x</th><th></th> <th>\mathrm{Occ}(x)</th><th></th> <th>I(x)</th><th></th> <th>\mathrm{NOcc}(x)</th><th></th> <th>\mathrm{NF}(x)</th></tr></thead><tbody><tr><td>bc</td><td></th> </tr> </thead><tbody><tr> <td>bc</td> <td>\{2,5,7,10\}</td><td>[5..8]</td><td></td> <td>[5..8]</td> <td>\{7\}</td><td>1</td></tr><tr><td>abc</td><td></td> <td>1</td> </tr> <tr> <td>abc</td> <td>\{1,9\}</td><td>[2..3]</td><td></td> <td>[2..3]</td> <td>\{1,9\}</td><td>2</td></tr><tr><td>bcb</td><td></td> <td>2</td> </tr> <tr> <td>bcb</td> <td>\{2,5\}</td><td>[3..4]</td><td></td> <td>[3..4]</td> <td>\{2,5\}</td><td>2</td></tr></tbody></table></div><p>For</td> <td>2</td> </tr> </tbody></table></div> <p>For x=\mathtt{bc},, I(\mathtt{bc})=[5..8],, L[5..8]=\mathtt{acab};thesingletonrightextensioninterval; the singleton right-extension interval [7..7]yields yields L[7]=c,whichisuniquein, which is unique in L[5..8],andthesubstring, and the substring \mathtt{bca}isuniquein is unique in T.Thus,7isanetoccurrence,and. Thus, 7 is a net occurrence, and \mathrm{NF}(\mathtt{bc})=1.</p><p>Thereversedtrieconstructedon.</p> <p>The reversed trie constructed on \{\mathtt{bc}, \mathtt{abc}, \mathtt{bcb}\}enables enables O(|P|)netfrequencyqueries,e.g.,for net frequency queries, e.g., for \mathtt{bcb}$, tracing path &#39;b&#39;→&#39;c&#39;→&#39;b&#39; yields $\mathrm{NF}=2.</p><h2class=paperheadingid=algorithmicandpracticalsignificance>7.AlgorithmicandPracticalSignificance</h2><p>TheO(n)time,O(r)spacealgorithmforenumeratingNSMRs,alongwiththeO(r)space,O(P)querydatastructurefornetfrequencies,demonstratesstrongscalabilityforhighlyrepetitivetexts,where.</p> <h2 class='paper-heading' id='algorithmic-and-practical-significance'>7. Algorithmic and Practical Significance</h2> <p>The O(n)-time, O(r)-space algorithm for enumerating NSMRs, along with the O(r)-space, O(|P|)-query data structure for net frequencies, demonstrates strong scalability for highly repetitive texts, where r\ll n.Theoreticalboundssuchas. Theoretical bounds such as \#\mathrm{NSMR},totalnetfrequencies,andminimaluniquesubstringsallbeing, total net frequencies, and minimal unique substrings all being O(r)$ suggest tractability for applications in genomics and versioned document collections. The duality between net occurrences and MUSs, along with the efficient algorithms described, position NSMRs as a new central object of study in context-sensitive repeat analysis (Kimura et al., 14 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Near-Supermaximal Repeats (NSMRs).