Near-Supermaximal Repeats (NSMRs)
- NSMRs are context-sensitive repeats defined by net occurrences not covered by longer repeats, distinguishing them from traditional repeat measures.
- Efficient RLBWT-based algorithms enable O(1) LF-mapping and range-distinct queries, ensuring fast analysis in highly repetitive texts.
- Breadth-first enumeration with net occurrence detection achieves O(n) time and O(r) space, establishing novel theoretical bounds in string analysis.
A near-supermaximal repeat (NSMR) is a context-sensitive repeat in a string whose net frequency is positive, where net occurrences are defined as occurrences of a repeat not covered by any strictly longer repeat. NSMRs generalize the supermaximal repeat concept by focusing on the "net frequency" of a repeat, which counts only the occurrences that are not subsumed by other, longer repeats. Efficient enumeration and query frameworks for NSMRs have recently been developed to leverage highly repetitive structure in massive texts, providing new theoretical bounds and practical data structures (Kimura et al., 14 Nov 2025).
1. Formal Definitions and Notation
Let be a string over the alphabet $\Sigma_0 = \Sigma \cup \{\$}$, where \$ is a sentinel appearing only at and . For any string of length , define its set of occurrences as:
A string is a repeat if .
- Net occurrence: An occurrence is called a net occurrence if it is not covered by any strictly longer repeat.
- Net frequency: For a repeat , is the number of its net occurrences:
- Near-supermaximal repeat (NSMR): A repeat is an NSMR if and only if ; that is, has at least one net occurrence.
This framework distinguishes between simple multiplicity of substrings and the subset of those occurrences that are not contained within occurrences of longer repeats, thus capturing "context-sensitive" maximality.
2. Core Data Structures: RLBWT and Associated Operations
NSMR enumeration and query algorithms operate efficiently via data structures based on the run-length encoded Burrows-Wheeler Transform (RLBWT) of . The BWT, , is run-length encoded as
where , with the number of runs (maximal sequences of the same symbol). The pair forms the RLBWT.
Key RLBWT operations include:
- LF-mapping and FL-inverse in time and space, leveraging the "move" data structure.
- Range-distinct queries—enumerating all distinct symbols within —in time, with the number of distinct symbols and preprocessing.
Maintaining all active structures in space is essential, as for highly repetitive texts.
3. Enumeration Algorithm for NSMRs
The enumeration of NSMRs alongside their net occurrences proceeds as a breadth-first traversal of all right-maximal repeats using Weiner (left-extension) links. Each repeat is represented as:
where is the SA-interval of , is the list of right extensions, and is the length.
Algorithm steps:
- Preprocess RLBWT for LF/FL and range-distinct queries.
- Use a queue initialized with .
- For each element in the queue: a. Detect net occurrences: for every singleton right-extension interval , if is unique in and is unique, record as a net occurrence. NSMRs are precisely repeats with positive net frequency. b. Generate child repeats by extending left with all possible found by range-distinct queries. c. Push any child repeat with or positive net frequency into the next queue.
Each edge in the suffix tree is traversed once, with each operation per edge in time, resulting in overall time and space complexity.
4. Data Structures for Net Frequency Queries
An -space data structure enables querying the net frequency of any pattern in time after -time construction. The process is as follows:
- Collect the set of all NSMRs and their net frequencies during enumeration.
- Construct a compacted reversed trie of :
- Nodes correspond to suffixes of NSMRs.
- Edges are labeled by single characters, represented implicitly.
- Each node stores , a pointer , and .
- At each branching node, store a degree-dependent dictionary for character lookup.
- Querying for pattern proceeds right-to-left in time; if is found, return , else $0$.
The trie size is , leveraging the bound that total net occurrences across all NSMRs is .
5. Theoretical Limits and Connections
A key theoretical result is that the total number of net occurrences is strictly less than $2r$, where is the number of runs in the BWT. Specifically, every net occurrence must correspond to a boundary of a run in , and each run admits at most two such boundaries:
This property provides a new upper bound not only for net occurrences but, by duality, also for the number of minimal unique substrings (MUS):
This suggests a close structural link between net occurrences in repeats and the landscape of uniquely identifying substrings.
6. Illustrative Example
Consider the string $T = \mathtt{abcbbcbcabc\$}n = 12L = \mathtt{cc\$cacabbbbb} r = 7c2\cdot \$\cdot c\cdot a\cdot c\cdot a\cdot b^5x\mathrm{Occ}(x)I(x)\mathrm{NOcc}(x)\mathrm{NF}(x)\{2,5,7,10\}\{7\}\{1,9\}\{1,9\}\{2,5\}\{2,5\}x=\mathtt{bc}I(\mathtt{bc})=[5..8]L[5..8]=\mathtt{acab}[7..7]L[7]=cL[5..8]\mathtt{bca}T\mathrm{NF}(\mathtt{bc})=1\{\mathtt{bc}, \mathtt{abc}, \mathtt{bcb}\}O(|P|)\mathtt{bcb}$, tracing path 'b'→'c'→'b' yields $\mathrm{NF}=2r\ll n\#\mathrm{NSMR}O(r)$ suggest tractability for applications in genomics and versioned document collections. The duality between net occurrences and MUSs, along with the efficient algorithms described, position NSMRs as a new central object of study in context-sensitive repeat analysis (Kimura et al., 14 Nov 2025).