Near-Supermaximal Repeats (NSMRs)

Updated 18 November 2025

NSMRs are context-sensitive repeats defined by net occurrences not covered by longer repeats, distinguishing them from traditional repeat measures.
Efficient RLBWT-based algorithms enable O(1) LF-mapping and range-distinct queries, ensuring fast analysis in highly repetitive texts.
Breadth-first enumeration with net occurrence detection achieves O(n) time and O(r) space, establishing novel theoretical bounds in string analysis.

A near-supermaximal repeat (NSMR) is a context-sensitive repeat in a string whose net frequency is positive, where net occurrences are defined as occurrences of a repeat not covered by any strictly longer repeat. NSMRs generalize the supermaximal repeat concept by focusing on the "net frequency" of a repeat, which counts only the occurrences that are not subsumed by other, longer repeats. Efficient enumeration and query frameworks for NSMRs have recently been developed to leverage highly repetitive structure in massive texts, providing new theoretical bounds and practical data structures (Kimura et al., 14 Nov 2025).

1. Formal Definitions and Notation

Let $T[1..n]$ be a string over the alphabet $\Sigma_0 = \Sigma \cup \{\$}$, where \$ is a sentinel appearing only at $T[0]$ and $T[n]$ . For any string $x$ of length $m$ , define its set of occurrences as:

$\mathrm{Occ}(x) = \{ i \in [1..n - m + 1] \mid T[i..i+m-1] = x \}.$

A string $x$ is a repeat if $|\mathrm{Occ}(x)| \geq 2$ .

Net occurrence: An occurrence $i \in \mathrm{Occ}(x)$ is called a net occurrence if it is not covered by any strictly longer repeat.
Net frequency: For a repeat $x$ , $\mathrm{NF}(x)$ is the number of its net occurrences:

$\mathrm{NF}(x) = |\{\,i \in \mathrm{Occ}(x) \mid i \text{ is a net occurrence of } x \}|.$

Near-supermaximal repeat (NSMR): A repeat $x$ is an NSMR if and only if $\mathrm{NF}(x) > 0$ ; that is, $x$ has at least one net occurrence.

This framework distinguishes between simple multiplicity of substrings and the subset of those occurrences that are not contained within occurrences of longer repeats, thus capturing "context-sensitive" maximality.

2. Core Data Structures: RLBWT and Associated Operations

NSMR enumeration and query algorithms operate efficiently via data structures based on the run-length encoded Burrows-Wheeler Transform (RLBWT) of $T$ . The BWT, $L[1..n]$ , is run-length encoded as

$L = c_1^{d_1} c_2^{d_2} \ldots c_r^{d_r}$

where $c_j \neq c_{j+1}$ , with $r$ the number of runs (maximal sequences of the same symbol). The pair $(c[1..r], d[1..r])$ forms the RLBWT.

Key RLBWT operations include:

LF-mapping and FL-inverse in $O(1)$ time and $O(r)$ space, leveraging the "move" data structure.
Range-distinct queries—enumerating all distinct symbols within $L[p..q]$ —in $O(k)$ time, with $k$ the number of distinct symbols and $O(r)$ preprocessing.

Maintaining all active structures in $O(r)$ space is essential, as $r \ll n$ for highly repetitive texts.

3. Enumeration Algorithm for NSMRs

The enumeration of NSMRs alongside their net occurrences proceeds as a breadth-first traversal of all right-maximal repeats using Weiner (left-extension) links. Each repeat $x$ is represented as:

$\operatorname{repr}(x) = (I(x), \text{rlist}(x), |x|)$

where $I(x) = [p..q]$ is the SA-interval of $x$ , $\text{rlist}(x) = \{ (c, I(xc)) \mid c \in rc(x) \}$ is the list of right extensions, and $|x|$ is the length.

Algorithm steps:

Preprocess RLBWT for $O(1)$ LF/FL and range-distinct queries.
Use a queue initialized with $\operatorname{repr}(\varepsilon) = ([1..n], \text{rlist}(\varepsilon), 0)$ .
For each element in the queue: a. Detect net occurrences: for every singleton right-extension interval $[p_c..q_c]$ , if $L[i]$ is unique in $L[p..q]$ and $T[\mathrm{SA}[i]..\mathrm{SA}[i]+\ell]$ is unique, record $SA[i]$ as a net occurrence. NSMRs are precisely repeats with positive net frequency. b. Generate child repeats by extending left with all possible $a \in \Sigma$ found by range-distinct queries. c. Push any child repeat $\operatorname{repr}(ax)$ with $|\text{rlist}(ax)| > 1$ or positive net frequency into the next queue.

Each edge in the suffix tree is traversed once, with each operation per edge in $O(1)$ time, resulting in $O(n)$ overall time and $O(r)$ space complexity.

4. Data Structures for Net Frequency Queries

An $O(r)$ -space data structure enables querying the net frequency of any pattern $P$ in $O(|P|)$ time after $O(n)$ -time construction. The process is as follows:

Collect the set $X$ of all NSMRs and their net frequencies during enumeration.
Construct a compacted reversed trie $T'$ $T^{'}$ of $X$ $X$ :
- Nodes correspond to suffixes of NSMRs.
- Edges are labeled by single characters, represented implicitly.
- Each node stores $|x|$ , a pointer $i_x$ , and $\mathrm{NF}(x)$ .
At each branching node, store a degree-dependent dictionary for $O(1)$ character lookup.
Querying for pattern $P$ proceeds right-to-left in $O(|P|)$ time; if $P$ is found, return $\mathrm{NF}(P)$ , else $0$.

The trie size is $O(r)$ , leveraging the bound that total net occurrences across all NSMRs is $<2r$ .

5. Theoretical Limits and Connections

A key theoretical result is that the total number of net occurrences is strictly less than $2r$, where $r$ is the number of runs in the BWT. Specifically, every net occurrence must correspond to a boundary of a run in $L$ , and each run admits at most two such boundaries:

$\text{Total net occurrences} \le 2r-1 < 2r$

This property provides a new upper bound not only for net occurrences but, by duality, also for the number of minimal unique substrings (MUS):

$\#\text{MUS} < 2r$

This suggests a close structural link between net occurrences in repeats and the landscape of uniquely identifying substrings.

6. Illustrative Example

Consider the string $T = \mathtt{abcbbcbcabc\$} $,$ n = 12 $. Its BWT and run decomposition are:$ L = \mathtt{cc\$cacabbbbb} $with$ r = 7 $runs ($ c^2\cdot \$\cdot c\cdot a\cdot c\cdot a\cdot b^5 $). Enumeration identifies three NSMRs:</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>$ x $</th> <th>$ \mathrm{Occ}(x) $</th> <th>$ I(x) $</th> <th>$ \mathrm{NOcc}(x) $</th> <th>$ \mathrm{NF}(x) $</th> </tr> </thead><tbody><tr> <td>bc</td> <td>$ \{2,5,7,10\} $</td> <td>[5..8]</td> <td>$ \{7\} $</td> <td>1</td> </tr> <tr> <td>abc</td> <td>$ \{1,9\} $</td> <td>[2..3]</td> <td>$ \{1,9\} $</td> <td>2</td> </tr> <tr> <td>bcb</td> <td>$ \{2,5\} $</td> <td>[3..4]</td> <td>$ \{2,5\} $</td> <td>2</td> </tr> </tbody></table></div> <p>For$ x=\mathtt{bc} $,$ I(\mathtt{bc})=[5..8] $,$ L[5..8]=\mathtt{acab} $; the singleton right-extension interval$ [7..7] $yields$ L[7]=c $, which is unique in$ L[5..8] $, and the substring$ \mathtt{bca} $is unique in$ T $. Thus, 7 is a net occurrence, and$ \mathrm{NF}(\mathtt{bc})=1 $.</p> <p>The reversed trie constructed on$ \{\mathtt{bc}, \mathtt{abc}, \mathtt{bcb}\} $enables$ O(|P|) $net frequency queries, e.g., for$ \mathtt{bcb}$, tracing path 'b'→'c'→'b' yields $\mathrm{NF}=2 $.</p> <h2 class='paper-heading' id='algorithmic-and-practical-significance'>7. Algorithmic and Practical Significance</h2> <p>The O(n)-time, O(r)-space algorithm for enumerating NSMRs, along with the O(r)-space, O(|P|)-query data structure for net frequencies, demonstrates strong scalability for highly repetitive texts, where$ r\ll n $. Theoretical bounds such as$ \#\mathrm{NSMR} $, total net frequencies, and minimal unique substrings all being$ O(r)$ suggest tractability for applications in genomics and versioned document collections. The duality between net occurrences and MUSs, along with the efficient algorithms described, position NSMRs as a new central object of study in context-sensitive repeat analysis (Kimura et al., 14 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

R-enum Revisited: Speedup and Extension for Context-Sensitive Repeats and Net Frequencies (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Near-Supermaximal Repeats (NSMRs).