Bidirectional Indexing Scheme
- Bidirectional indexing schemes are advanced data structures that support both leftward and rightward extension and contraction of search patterns.
- They leverage coordinated structures like suffix trees, BWTs, and DAWGs to efficiently handle exact and approximate pattern matching in texts and tries.
- These schemes achieve optimal trade-offs between space and time efficiency, making them ideal for applications such as genome analysis and large-scale text analytics.
A bidirectional indexing scheme is an advanced data structure paradigm that enables both leftward and rightward extension and contraction of search patterns within text collections or tries. Such schemes are foundational in modern string processing, enabling efficient exact or approximate string matching, pattern discovery, and variable-order text analytics. The core idea is to maintain synchronized data structures that permit constant or logarithmic time navigation and update in both directions, supporting applications that require full access to both prefix and suffix information of search patterns.
1. Fundamentals of Bidirectional Indexing
Bidirectional indexing generalizes standard suffix-based text indexes by allowing alternation of left and right search operations. Classically, suffix trees and FM-indexes support either forward (left-to-right) or backward (right-to-left) searches. In bidirectional schemes, two coordinated indexes—typically constructed on the forward and reversed versions of the input—allow tracking and mutation of the search locus on both ends.
Given a text (or, in the trie case, a labeled tree of nodes over alphabet of size ), a bidirectional scheme enables, for a current search string , the following primitive operations:
- extendLeft(; ): Prepend character to , yielding 0.
- extendRight(1; 2): Append character 3 to 4, yielding 5.
- contractLeft(6): Remove the leftmost character, returning a suffix of 7.
- contractRight(8): Remove the rightmost character, returning a prefix of 9.
Synchronization of loci in the forward and reverse indexes is critical. This enables arbitrary orders of extensions and contractions while guaranteeing correctness and completeness of occurrences reported or traversed. The pioneering approach for labeled tree (trie) indexing combines the explicit suffix tree of the reversed trie 0 and a compact, implicit Directed Acyclic Word Graph (DAWG) representation for the forward trie 1 (Inenaga, 2019). For strings, bidirectional FM-index designs operate over the Burrows-Wheeler transform (BWT) of 2 and its reverse (Belazzougui et al., 2016).
2. Data Structures Supporting Bidirectionality
Efficient bidirectional indexing relies on tightly coupled data structures:
| Structure | Purpose | Space Complexity |
|---|---|---|
| Suffix Tree (3) | Leftward (or rightward) traversal via Weiner/suffix links | 4 (strings/tries) |
| DAWG (implicit/compact) | Recognizes substrings for extensions in direct/forward orientation | 5 (implicit), up to 6 explicit |
| BWT (and BWT of reverse) | Succinct representation of suffix intervals for extend/contract | 7 bits |
| Balanced-parentheses/topology | Succinct tree representation to support parent/ancestor/lca | 8 bits |
| Run-length BWT (RLBWT) | Space-efficient BWT variant for repetitive texts | 9 bits (Cunial et al., 2019) |
Suffix trees and DAWGs are leveraged in the trie case, while for string indices, dual BWTs and their auxiliary rank/select and Weiner-link data structures are central. The method of micro–macro decomposition is used for implementing DAWG transitions in compact space, where subtrees of the suffix tree are partitioned into micro-trees of 0 size allowing efficient ancestor/Weiner link queries (Inenaga, 2019).
The concept of affix trees/arrays—bidirectional combinations of classical suffix and reversed-suffix trees or arrays—is another manifestation of bidirectional indexing. However, these may require quadratic space in the worst case for forward tries, motivating research into compact implicit representations (Inenaga, 2019).
3. Core Algorithms and Efficiency
Bidirectional indexing algorithms are characterized by:
- Construction:
- Building 1 (reverse trie’s suffix tree) and calculating the DAWG for the forward trie 2. In the string case, constructing BWTs of 3 and 4 is done in 5 (deterministic or randomized) time (Belazzougui et al., 2016).
- Implicit DAWG representations using micro-macro decomposition can be constructed in 6 time/space, independent of alphabet size (Inenaga, 2019).
- Fully-functional bidirectional indexes can be constructed in 7 (randomized) time and 8 bits (Belazzougui et al., 2016, Cunial et al., 2019).
- Query Operations:
- Extend/Contract: For the trie, each operation (extend-left, extend-right) is implemented in 9 time via edge or Weiner-link simulation in the 0 and DAWG. For succinct BWT-based schemes, all four primitive operations run in 1 time per operation (Belazzougui et al., 2016, Cunial et al., 2019).
- Enumeration: After constructing the desired pattern 2, occurrences are output in 3 time by subtree enumeration.
- Bidirectional Search Interface: Maintains two search loci: positions in the reverse suffix tree and in the DAWG or BWT/FMI, ensuring that any extension/contraction operation can be mapped to a well-defined state in both structures (Inenaga, 2019, Belazzougui et al., 2016).
Complexity summary for a pattern of length 4 and 5 occurrences is 6 (DAWG+STree, trie case (Inenaga, 2019)) or 7 (bidirectional BWT, string case (Belazzougui et al., 2016, Cunial et al., 2019)), with linear or near-linear preprocessing time and space.
4. Underlying Theoretical Principles
Central to efficient bidirectional indexing are several theoretical constructs:
- Weiner Links: These generalize the notion of extending substrings in the suffix tree by prepending a character. Hard Weiner links correspond to primary transitions in the DAWG; soft Weiner links correspond to secondary transitions. The simulation of arbitrary DAWG traversals in linear space leverages this duality (Inenaga, 2019).
- Suffix Link and Affix Link Mapping: The interplay between suffix links, reverse-suffix links, and affix links allows bidirectional navigation and mapping between corresponding loci in forward and reverse trees (Inenaga, 2019, Cunial et al., 2019).
- Balanced-Parentheses Representation: This succinct data structure enables 8 navigation (parent, ancestor, lca, child queries) within suffix trees, further facilitating constant-time contract/extend operations (Belazzougui et al., 2016, Cunial et al., 2019).
- Run-Length Compression and Compact BWTs: For repetitive texts, run-length compressed BWTs markedly reduce space overhead while preserving efficient operation support (Cunial et al., 2019).
- Search Scheme Formalism: In approximate matching, the restructuring of pattern search as a traversal over search schemes with partitioned/ordered blocks enables optimal trade-offs between index operations and search space enumerations (Kucherov et al., 2013).
5. Applications and Practical Significance
Bidirectional indexing schemes are essential in:
- Exact and Approximate Pattern Matching: All combinations of prefix and suffix extension/contraction are possible, supporting advanced approximate matching paradigms such as search schemes, which optimize error distribution coverage and minimize enumeration complexity (Kucherov et al., 2013).
- Genome and Sequence Analysis: High-throughput DNA sequencing requires indexing schemes that can manage bidirectional walks on large-scale, repetitive texts under stringent space and query time constraints (Belazzougui et al., 2016, Cunial et al., 2019).
- Variable-Order and de Bruijn Graph Analytics: Fully-functional indexes support frequency-aware, variable-order traversal in de Bruijn graphs, providing node and arc frequency computations and dynamic order changes on-the-fly without pre-set bounds (Cunial et al., 2019).
- Compressed Representation of Repetitive Collections: Space-efficient bidirectional indexes using CDAWG or run-length compressed BWTs enable scalable analysis of large and repetitive string datasets (Cunial et al., 2019).
6. Trade-offs, Limitations, and Variants
Different bidirectional scheme variants exhibit distinct trade-offs:
| Index Type | Space Complexity | Extension/Contraction Time |
|---|---|---|
| DAWG+STree (trie, implicit DAWG) | 9 words | 0 per extension |
| Affix tree/array | 1 (forward) | 2 (but impractical size) |
| Bidirectional BWT/FM-index | 3 bits | 4 |
| CDAWG-based (repetitive strings) | 5 words | 6 |
- Large alphabets (7) in forward tries induce quadratic worst-case space in explicit DAWG constructions, motivating the linear-space implicit representation (Inenaga, 2019).
- Fully-functional bidirectional indexes allow both extension and contraction in 8 time; earlier constructions had 9 for extension but only supported contraction from specific substrings (Cunial et al., 2019).
- For highly repetitive texts, CDAWG-based approaches reduce space complexity to 0, where 1 is the total number of left/right extensions of maximal repeats, with sub-logarithmic per-operation time (Cunial et al., 2019).
7. Illustrative Example
Consider a forward trie: 4 The corresponding reversed trie for bidirectional indexing is: 5 The explicit construction of the reverse suffix tree is 2 in size; implicit DAWG representation for the forward trie is maintained in 3 space via micro–macro decomposition and simulation of Weiner links. A pattern search such as "ba" is performed by alternately issuing extend-right and extend-left operations, maintaining loci in both the suffix tree and DAWG, and enumerating occurrences directly from the suffix tree subtree rooted at the final locus (Inenaga, 2019).
Bidirectional indexing schemes unify efficient bidirectional pattern matching, compact representation, and navigability in text and trie settings, with operations and space bounded optimally in theoretical and practical contexts (Inenaga, 2019, Belazzougui et al., 2016, Cunial et al., 2019, Kucherov et al., 2013).