Papers
Topics
Authors
Recent
Search
2000 character limit reached

Lyndon Grammar: Efficient Compression & Indexing

Updated 22 January 2026
  • The Lyndon Grammar Approach is defined by the unique factorization of strings into Lyndon words, enabling concise grammar-based compression.
  • It leverages combinatorial structures like Lyndon trees and straight-line programs to facilitate efficient Burrows-Wheeler Transform construction and self-indexing.
  • By exploiting repetitive patterns in massive datasets, the method achieves significant compression and enhanced performance in text indexing and pattern matching.

The Lyndon grammar approach applies the combinatorial theory of Lyndon words to grammar-based text compression and efficient sequence indexing. By leveraging the unique factorization of strings into Lyndon factors and exploiting repetitiveness, Lyndon grammars enable compact grammar representations, facilitate efficient Burrows-Wheeler Transform (BWT) construction, and support self-indexing on massive datasets with potentially exponential input size. This method provides both theoretical and practical advances in grammar compression, compressed computation, and succinct data structure design.

1. Definitions: Lyndon Words, Factorization, and the Lyndon Tree

Let Σ\Sigma be a totally ordered alphabet. A non-empty word wΣ+w \in \Sigma^+ is a Lyndon word if, for every nontrivial rotation vv of ww, one has w<vw < v in the lexicographic order. Equivalently, ww is strictly lexicographically smaller than any of its proper non-empty suffixes. The foundation for the Lyndon grammar approach is the Chen–Fox–Lyndon theorem: every string SΣ+S \in \Sigma^+ can be factorized uniquely as

S=L1L2Lk,S = L_1 L_2 \cdots L_k,

where each LiL_i is a Lyndon word and L1L2LkL_1 \geq L_2 \geq \dots \geq L_k lexicographically. This is known as the Lyndon factorization of SS, and can be computed in linear time for explicit inputs via Duval's algorithm (Olbrich, 27 Apr 2025, Tsuruta et al., 2020, Badkobeh et al., 2020).

A pivotal recursive structure induced by the standard factorization (for a Lyndon word ww of length at least $2$, there is a unique decomposition w=uvw=uv such that vv is the longest proper suffix of ww which itself is Lyndon) is the Lyndon tree, a full binary tree where each inner node corresponds to a standard factorization of a Lyndon substring. This tree structure directly corresponds to a context-free grammar (straight-line program, SLP) whose nonterminals expand according to these Lyndon splits (Tsuruta et al., 2020, Badkobeh et al., 2020).

2. Grammar Construction: The Lyndon SLP and Algorithmic Variants

A straight-line program (SLP) is a context-free grammar in Chomsky normal form that derives exactly one string. In the Lyndon grammar approach, a Lyndon SLP G={X1,,Xg}ΣG = \{X_1, \ldots, X_g\} \cup \Sigma is constructed so that:

  • Each nonterminal XiX_i expands according to the standard factorization of the Lyndon word it generates, i.e., XiXaXbX_i \rightarrow X_a X_b if val(Xi)=val(Xa)val(Xb)val(X_i) = val(X_a) val(X_b) and this matches the standard factorization.
  • Terminals are rules of the form XicX_i \rightarrow c for cΣc \in \Sigma.
  • For multiple input sequences (e.g., in multiset or collection settings), several start symbols r1,,rkr_1, \ldots, r_k can be used, whose derived words are Lyndon and ordered r1r2rkr_1 \geq r_2 \geq \cdots \geq r_k lexicographically (Olbrich, 27 Apr 2025).

The construction can be performed via several algorithmic variants:

Algorithm Time Complexity Working Memory Notes
Naïve stack + suffix compare O(Ng)O(Ng) O(g)O(g) Simple; uses explicit comparisons for merging
Sorted B-tree variant O(Nlogg+glog2g)O(N \log g + g \log^2 g) O(g)O(g) Lex order comparison via balanced trees
Randomized hash variant O(N)O(N) expected O(g)O(g) Fastest; merges via hashing of subtree pairs

Here, NN is the input text length, gg is the number of distinct Lyndon subtrees (grammar size), and the working memory always depends on gg. On highly repetitive inputs, typically gNg \ll N (Olbrich, 27 Apr 2025, Tsuruta et al., 2020).

An alternative, left-to-right construction based on the left Lyndon suffix table and tree achieves O(n)O(n) letter-comparison cost for explicit input and produces a Lyndon-tree grammar with at most $2n-1$ symbols for input of length nn (Badkobeh et al., 2020).

3. Properties of the Lyndon Grammar: Rules, Size, and Structure

Lyndon grammars are uniquely determined by their standard factorizations, with the following properties:

  • Nonterminals are introduced via the standard factorization test: whenever two consecutive Lyndon factors can be merged (by checking if their concatenation remains Lyndon), a rule XiXaXbX_i \to X_a X_b is created.
  • The grammar grows only when such merges succeed; this is highly beneficial on repetitive data where identical Lyndon subtrees recur, leading to significant reduction gNg \ll N (Olbrich, 27 Apr 2025).
  • For a grammar GG with RR non-terminal rules and kk start symbols: G=R+k|G| = R + k.
  • Depth and structure: The depth of the grammar, depth(G)depth(G), is determined by the height of the underlying Lyndon forest. For random strings, depth(G)=O(logN)depth(G) = O(\log N); for the worst case, G=Θ(N)|G|= \Theta(N) but is much smaller in practical, repetitive sequences (Olbrich, 27 Apr 2025, Badkobeh et al., 2020).
  • For collection input, each sequence is reduced to its unique Lyndon root (lexicographically minimal rotation), and concatenation proceeds in decreasing order (Olbrich, 27 Apr 2025).

A crucial distinction is that all identical Lyndon subtrees (i.e., repeated Lyndon factors) are merged into a single nonterminal, enabling compactness analogous to RePair and other grammar-based compressors, yet tailored to the combinatorics of repetitions and periodicities within the Lyndon factor lattice (Tsuruta et al., 2020).

4. Applications: BWT Construction, Self-Indexing, and Pattern Matching

Efficient BWT and eBWT Construction

A primary application of the Lyndon grammar approach is the construction of the Burrows-Wheeler Transform (BWT) and extended-BWT (eBWT) directly from the grammar without full decompression:

  • After sorting the nonterminals lexicographically, the BWT can be induced in a single left-to-right pass over the grammar, emitting run-length encoded (RLE) BWT representations.
  • For each grammar rule XiXaXbX_i \to X_a X_b, positions in the text corresponding to XbX_b's infinite-periodic order yield adjacent BWT runs.
  • The procedure maintains a list L[0..2g)L[0..2g) of runs and streams them in grammar order to output the RLE-BWT in O(N)O(N) time and O(g)O(g) space (plus O(r)O(r) for RLE output, where rr is number of runs) (Olbrich, 27 Apr 2025).

For the eBWT over a collection M={S1,,St}\mathcal{M} = \{S_1, \ldots, S_t\}, each SiS_i is first canonicalized to its Lyndon root, then the same procedure applies to the ordered concatenation (Olbrich, 27 Apr 2025).

Self-Indexing

Lyndon SLPs support compressed self-indexing with O(g)O(g) space:

  • Pattern matching for a pattern PP of length mm over a text compressed by its Lyndon SLP (size gg) can be performed in O(m+logmlogn+occlogg)O(m + \log m \log n + occ \log g) time, where nn is the original text length and occocc is number of pattern occurrences.
  • The approach identifies "partition pairs," i.e., possible divisions of PP induced by grammar productions into suffix/prefix alignments with Lyndon factors. The number of significant suffixes (candidates) for such divisions is O(logm)O(\log m) (Tsuruta et al., 2020).
  • All structures (SLP, finger-printing, random-access, z-fast tries) can be built in O(nlogn)O(n \log n) expected time using appropriate data structures (Tsuruta et al., 2020).

Compressed Lyndon Factorization

Given a string ww in grammar-compressed form (SLP of size nn, height hh), its Lyndon factorization can be computed in O(n4+mn3h)O(n^4 + mn^3h) time and O(n2)O(n^2) space, with mm the factorization size. This is the first such polynomial time result when w|w| may be exponentially large in nn (I et al., 2013).

5. Empirical and Theoretical Performance

Experimental results confirm that Lyndon grammar-based algorithms are efficient on large, highly repetitive datasets:

  • Datasets comprising gigabase-scale genomic collections (human haplotypes, SARS-CoV-2 genomes), with up to 6×10106 \times 10^{10} symbols, are processed.
  • Compared with SA-IS, Big-BWT, recursive PFP, and other advanced BWT tools, the Lyndon grammar approach yields BWT/eBWT up to $2$–3×3\times faster or using $2$–5×5\times less RAM, especially under multithreading.
  • Grammar construction is the bottleneck (~98% of total runtime in fastest variants), yet even naïve implementations outperform many specialized parsers for real data.
  • Parallel scaling is favorable: up to 32 threads, wall time reduces to approx $1/10$th baseline; memory increases moderately to accommodate thread-safe data structures (Olbrich, 27 Apr 2025).

Theoretical guarantees ensure that grammar size is minimized by merging identical subtrees, with worst-case size linear in input but much smaller on repetitive texts. Deep grammars arise only in contrived worst-case patterns (e.g., an1ba^{n-1}b) (Olbrich, 27 Apr 2025, Badkobeh et al., 2020).

6. Limitations and Open Directions

Key limitations include:

  • Lyndon grammar construction requires O(g)O(g) working memory and may be Θ(N)\Theta(N) in the worst case, although practical performance is superior on repetitive inputs.
  • Suffix-comparison variants trade off simplicity and speed; efficient hashing or balanced tree variants mitigate worst-case behavior at the cost of algorithmic complexity (Olbrich, 27 Apr 2025).
  • The minimal grammar problem remains NP-hard; Lyndon-based SLPs are not always size-optimal compared to Sequitur, RePair, or other compressors, though they are more canonical and better-suited for certain combinatorial properties (Badkobeh et al., 2020).
  • Future work directions include improved parallelization, support for very large diverse collections with 64-bit symbol tables, and seeking faster algorithms for Lyndon factorization when input is in LZ78 or other compressed forms (Olbrich, 27 Apr 2025, I et al., 2013).

7. Connections to Other Compressed Computation and Stringology

The Lyndon grammar approach unifies combinatorics on words, compressed data structures, and efficient algorithm design. It provides a means to:

  • Relate grammar-based compression to lex order and repetitive structure via the unique properties of Lyndon words;
  • Extend SLP techniques with canonical binary trees reflecting periodicities in input, benefiting pattern matching and suffix sorting tasks;
  • Suggest new optima in run-length encoding for BWT, self-indexing, and compressed computation in massive genomics or text repositories (Olbrich, 27 Apr 2025, Tsuruta et al., 2020, Badkobeh et al., 2020).

A plausible implication is that the Lyndon grammar approach, by canonically capturing repetitive structure, can serve as a foundation for advanced compressed computation (indexing, matching, transform computation) where both theory and practice require handling extremely large, repetitive datasets efficiently.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lyndon Grammar Approach.