Net Occurrences in String Analysis
- Net occurrences are defined as context-sensitive counts of substring repeats that remain non-extendable on both sides, ensuring maximal contextual distinctness.
- They underpin efficient algorithms—using suffix trees, Weiner methods, and BWT techniques—that achieve linear or optimal time and space complexities for pattern mining.
- Applications include text compression, bioinformatics, and string kernel methods, providing robust insights into repeating patterns and structural properties in data.
Net occurrences are a central notion in combinatorics on words, string processing, and applied pattern-mining, quantifying the number of “significant” repetitions of substrings in texts or sequences. The net occurrence of a repeat is defined through context-sensitive uniqueness of its surrounding substrings, offering a structural refinement to the standard count of (possibly overlapping) substring occurrences. This concept underlies efficient algorithms for maximal repeat discovery, string kernel computation, text compression, and fine-grained combinatorial analyses of infinite words such as Fibonacci and Thue-Morse sequences.
1. Formal Definitions and Characterizations
A substring of a string is a repeat if it appears at least twice in . An occurrence of is a net occurrence if both its immediate left and right extensions are unique, i.e.,
- ,
- ,
- ,
where , , assuming sentinels at the boundaries. The net frequency is the cardinality of net occurrences of in . This definition is equivalent to requiring that any occurrence of in is not “covered” by a strictly longer repeat occurring elsewhere, making each net occurrence a witness to ’s maximal contextual distinctness (Guo et al., 2024, Inenaga, 2024, Kimura et al., 14 Nov 2025).
Alternate characterizations employ structural properties of suffix trees and the Burrows-Wheeler Transform: for example, in the run-length BWT, all net occurrences of all repeats correspond one-to-one with run boundaries, yielding the global bound for BWT runs (Kimura et al., 14 Nov 2025).
2. Algorithmic Computation: Offline and Online Methods
Efficient enumeration and reporting of net occurrences, as well as answering Single-NF (query for the net frequency of a pattern) and All-NF (report all substrings with positive net frequency) arise in several algorithmic frameworks:
- Suffix Tree Methods: Suffix trees allow -time offline extraction of all positive-NF substrings and their net frequencies via a traversal that exploits branching and extension-uniqueness properties. Online variants using implicit suffix trees (Ukkonen’s construction with Breslauer–Italiano maintenance) achieve time for Single-NF on a constant-sized alphabet, and for All-NF (Guo et al., 2024).
- Weiner-based Algorithms: For large alphabets, Weiner’s right-to-left suffix tree construction supplies an online -time algorithm, with for Single-NF and optimal output-sensitive time for All-NF, eliminating the dependency of previous approaches (Inenaga, 2024).
- Burrows-Wheeler Transform (BWT) and RLBWT: Leveraging the compression of the run-length BWT, r-enumeration algorithms enumerate all net occurrences (context-sensitive repeats/NSMRs) in time and space and build data structures that answer Single-NF queries in time, where is the query pattern length (Kimura et al., 14 Nov 2025).
- Suffix Array and BWT Structures: Offline, suffix arrays combined with the LCP array and colored range listing data structures yield time algorithms for All-NF and for Single-NF, with provably optimal runtimes and demonstrated scalability to massive data sets (GB-scale texts) (Guo et al., 2024).
<table> <thead> <tr><th>Algorithmic Paradigm</th><th>Single-NF Time</th><th>All-NF Time</th></tr> </thead> <tbody> <tr><td>Suffix Tree (const. alphabet)</td><td></td><td></td></tr> <tr><td>Weiner Tree (general )</td><td></td><td> (output sensitive)</td></tr> <tr><td>RLBWT ( runs)</td><td></td><td></td></tr> <tr><td>Suffix Array/BWT</td><td></td><td></td></tr> </tbody> </table>
3. Structural Properties and Upper Bounds
Net occurrences display rich combinatorial structure:
- Maximality: Any substring with must be a branching substring in the suffix tree and cannot be extended on either side without becoming unique. Only “significant” repeats—those which cannot be further extended while remaining a repeat—contribute nonzero net frequency (Guo et al., 2024, Guo et al., 2024).
- Global Bounds: There are at most substrings with for a text of length ; the sum of their lengths lies between and where is a repetitiveness measure (Guo et al., 2024, Kimura et al., 14 Nov 2025).
- Compressed Space: The total number of net occurrences over all repeats is less than $2r$, where is the number of RLBWT runs. Consequently, all context-diverse repeats (near-supermaximal repeats/NSMRs) can be enumerated in space (Kimura et al., 14 Nov 2025). The number of minimal unique substrings (MUSs)—the dual objects to net occurrences—is also (Kimura et al., 14 Nov 2025).
4. Combinatorial Analysis in Structured Words
Combinatorial investigations of net occurrences have focused on infinite morphic words:
- Fibonacci Words: Each Fibonacci word has exactly three net occurrences: the single occurrence of at position , and two occurrences of at positions 1 and , where (Guo et al., 5 May 2025). These net occurrences form an overlapping net occurrence cover (ONOC), and any potential further occurrence would have to be a super-occurrence of a bridging net sub-occurrence, which is impossible by structural constraints.
- Thue-Morse Words: Each Thue-Morse word with contains exactly nine net occurrences, corresponding to two occurrences each of selected derived factors (, , , ) (Guo et al., 5 May 2025). Explicit combinatorial recursions characterize all starting positions and the overlap structure of net occurrences.
These results establish tight lower bounds for the number of net occurrences in highly repetitive infinite sequences, thus providing combinatorial lower-bound instances for algorithmic analysis (Guo et al., 5 May 2025, Guo et al., 2024).
5. Relationship to Minimal Unique Substrings and Coverage
A fundamental connection exists between net occurrences and minimal unique substrings (MUSs):
- ENO-MUS Correspondence: The sorted lists of extended net occurrences and MUSs interleave, and one can reconstruct one from the other in output-sensitive time (Mieno et al., 2024).
- Gap Filling: Every gap between consecutive MUSs is a net occurrence of a repeat; conversely, every gap between extended net occurrences is a MUS.
- Characterization: The number of extended net occurrences in is exactly one less than the number of MUSs, i.e., (Mieno et al., 2024). This pairing yields efficient algorithms for enumerating both families from succinct representations.
These insights unify net occurrence theory with the broader study of string uniqueness and attractors in repetitive strings.
6. Space-Efficient and Online Algorithms
Recent research focuses on maintaining net occurrence sets under streaming or space-restricted models:
- Sliding-Window Algorithms: By tracking active/secondary points in a sliding suffix tree, all extended net occurrences in a window of size can be maintained in space with reporting time (Mieno et al., 2024).
- CDAWG-Based Methods: The implicit CDAWG of supports dynamic maintenance and output-optimal reporting of ENO in space, with and per-update time . Constant-time support for all necessary substring queries is achieved through combinatorial extensions (Mieno et al., 2024).
- Compressed Data Structures: Compacted reversed tries over net-positive repeats/NSMRs, constructed from the RLBWT, enable -space query support for net frequencies (Kimura et al., 14 Nov 2025).
These approaches enable deployment in large-scale pipelines (compression, plagiarism detection, document fingerprinting) under stringent memory and latency constraints.
7. Applications and Significance
Net occurrence theory provides a rigorous framework for identifying “context-sensitive” repeats, with applications in:
- Text Compression and Tokenization: Net occurrences partition texts into significant blocks that cannot be subsumed by longer repeats, supporting maximal factorization and grammar-based models (Guo et al., 2024).
- Bioinformatics and Sequence Analysis: Enumeration of net occurrences corresponds to discovering context-diverse repeats, informative for genetic motif identification and DNA repeat masking (Kimura et al., 14 Nov 2025).
- String Kernel Methods: Positive net frequency substrings capture “maximally distinctive” features, enhancing the accuracy and explainability of sequence-based learning methods (Guo et al., 2024).
- Complexity Theory and Lower Bounds: Tight asymptotics for the quantity and structure of net occurrences in morphic words underpin lower-bound constructions for string-processing algorithm runtimes (Guo et al., 5 May 2025).
Advances in both theory and scalable computation position net occurrences as a robust abstraction in symbolic, computational, and applied stringology.