Lyndon Grammar: Efficient Compression & Indexing

Updated 22 January 2026

The Lyndon Grammar Approach is defined by the unique factorization of strings into Lyndon words, enabling concise grammar-based compression.
It leverages combinatorial structures like Lyndon trees and straight-line programs to facilitate efficient Burrows-Wheeler Transform construction and self-indexing.
By exploiting repetitive patterns in massive datasets, the method achieves significant compression and enhanced performance in text indexing and pattern matching.

The Lyndon grammar approach applies the combinatorial theory of Lyndon words to grammar-based text compression and efficient sequence indexing. By leveraging the unique factorization of strings into Lyndon factors and exploiting repetitiveness, Lyndon grammars enable compact grammar representations, facilitate efficient Burrows-Wheeler Transform (BWT) construction, and support self-indexing on massive datasets with potentially exponential input size. This method provides both theoretical and practical advances in grammar compression, compressed computation, and succinct data structure design.

1. Definitions: Lyndon Words, Factorization, and the Lyndon Tree

Let $\Sigma$ be a totally ordered alphabet. A non-empty word $w \in \Sigma^+$ is a Lyndon word if, for every nontrivial rotation $v$ of $w$ , one has $w < v$ in the lexicographic order. Equivalently, $w$ is strictly lexicographically smaller than any of its proper non-empty suffixes. The foundation for the Lyndon grammar approach is the Chen–Fox–Lyndon theorem: every string $S \in \Sigma^+$ can be factorized uniquely as

$S = L_1 L_2 \cdots L_k,$

where each $L_i$ is a Lyndon word and $L_1 \geq L_2 \geq \dots \geq L_k$ lexicographically. This is known as the Lyndon factorization of $S$ , and can be computed in linear time for explicit inputs via Duval's algorithm (Olbrich, 27 Apr 2025, Tsuruta et al., 2020, Badkobeh et al., 2020).

A pivotal recursive structure induced by the standard factorization (for a Lyndon word $w$ of length at least $2$, there is a unique decomposition $w=uv$ such that $v$ is the longest proper suffix of $w$ which itself is Lyndon) is the Lyndon tree, a full binary tree where each inner node corresponds to a standard factorization of a Lyndon substring. This tree structure directly corresponds to a context-free grammar (straight-line program, SLP) whose nonterminals expand according to these Lyndon splits (Tsuruta et al., 2020, Badkobeh et al., 2020).

2. Grammar Construction: The Lyndon SLP and Algorithmic Variants

A straight-line program (SLP) is a context-free grammar in Chomsky normal form that derives exactly one string. In the Lyndon grammar approach, a Lyndon SLP $G = \{X_1, \ldots, X_g\} \cup \Sigma$ is constructed so that:

Each nonterminal $X_i$ expands according to the standard factorization of the Lyndon word it generates, i.e., $X_i \rightarrow X_a X_b$ if $val(X_i) = val(X_a) val(X_b)$ and this matches the standard factorization.
Terminals are rules of the form $X_i \rightarrow c$ for $c \in \Sigma$ .
For multiple input sequences (e.g., in multiset or collection settings), several start symbols $r_1, \ldots, r_k$ can be used, whose derived words are Lyndon and ordered $r_1 \geq r_2 \geq \cdots \geq r_k$ lexicographically (Olbrich, 27 Apr 2025).

The construction can be performed via several algorithmic variants:

Algorithm	Time Complexity	Working Memory	Notes
Naïve stack + suffix compare	$O(Ng)$	$O(g)$	Simple; uses explicit comparisons for merging
Sorted B-tree variant	$O(N \log g + g \log^2 g)$	$O(g)$	Lex order comparison via balanced trees
Randomized hash variant	$O(N)$ expected	$O(g)$	Fastest; merges via hashing of subtree pairs

Here, $N$ is the input text length, $g$ is the number of distinct Lyndon subtrees (grammar size), and the working memory always depends on $g$ . On highly repetitive inputs, typically $g \ll N$ (Olbrich, 27 Apr 2025, Tsuruta et al., 2020).

An alternative, left-to-right construction based on the left Lyndon suffix table and tree achieves $O(n)$ letter-comparison cost for explicit input and produces a Lyndon-tree grammar with at most $2n-1$ symbols for input of length $n$ (Badkobeh et al., 2020).

3. Properties of the Lyndon Grammar: Rules, Size, and Structure

Lyndon grammars are uniquely determined by their standard factorizations, with the following properties:

Nonterminals are introduced via the standard factorization test: whenever two consecutive Lyndon factors can be merged (by checking if their concatenation remains Lyndon), a rule $X_i \to X_a X_b$ is created.
The grammar grows only when such merges succeed; this is highly beneficial on repetitive data where identical Lyndon subtrees recur, leading to significant reduction $g \ll N$ (Olbrich, 27 Apr 2025).
For a grammar $G$ with $R$ non-terminal rules and $k$ start symbols: $|G| = R + k$ .
Depth and structure: The depth of the grammar, $depth(G)$ , is determined by the height of the underlying Lyndon forest. For random strings, $depth(G) = O(\log N)$ ; for the worst case, $|G|= \Theta(N)$ but is much smaller in practical, repetitive sequences (Olbrich, 27 Apr 2025, Badkobeh et al., 2020).
For collection input, each sequence is reduced to its unique Lyndon root (lexicographically minimal rotation), and concatenation proceeds in decreasing order (Olbrich, 27 Apr 2025).

A crucial distinction is that all identical Lyndon subtrees (i.e., repeated Lyndon factors) are merged into a single nonterminal, enabling compactness analogous to RePair and other grammar-based compressors, yet tailored to the combinatorics of repetitions and periodicities within the Lyndon factor lattice (Tsuruta et al., 2020).

4. Applications: BWT Construction, Self-Indexing, and Pattern Matching

Efficient BWT and eBWT Construction

A primary application of the Lyndon grammar approach is the construction of the Burrows-Wheeler Transform (BWT) and extended-BWT (eBWT) directly from the grammar without full decompression:

After sorting the nonterminals lexicographically, the BWT can be induced in a single left-to-right pass over the grammar, emitting run-length encoded (RLE) BWT representations.
For each grammar rule $X_i \to X_a X_b$ , positions in the text corresponding to $X_b$ 's infinite-periodic order yield adjacent BWT runs.
The procedure maintains a list $L[0..2g)$ of runs and streams them in grammar order to output the RLE-BWT in $O(N)$ time and $O(g)$ space (plus $O(r)$ for RLE output, where $r$ is number of runs) (Olbrich, 27 Apr 2025).

For the eBWT over a collection $\mathcal{M} = \{S_1, \ldots, S_t\}$ , each $S_i$ is first canonicalized to its Lyndon root, then the same procedure applies to the ordered concatenation (Olbrich, 27 Apr 2025).

Self-Indexing

Lyndon SLPs support compressed self-indexing with $O(g)$ space:

Pattern matching for a pattern $P$ of length $m$ over a text compressed by its Lyndon SLP (size $g$ ) can be performed in $O(m + \log m \log n + occ \log g)$ time, where $n$ is the original text length and $occ$ is number of pattern occurrences.
The approach identifies "partition pairs," i.e., possible divisions of $P$ induced by grammar productions into suffix/prefix alignments with Lyndon factors. The number of significant suffixes (candidates) for such divisions is $O(\log m)$ (Tsuruta et al., 2020).
All structures (SLP, finger-printing, random-access, z-fast tries) can be built in $O(n \log n)$ expected time using appropriate data structures (Tsuruta et al., 2020).

Compressed Lyndon Factorization

Given a string $w$ in grammar-compressed form (SLP of size $n$ , height $h$ ), its Lyndon factorization can be computed in $O(n^4 + mn^3h)$ time and $O(n^2)$ space, with $m$ the factorization size. This is the first such polynomial time result when $|w|$ may be exponentially large in $n$ (I et al., 2013).

5. Empirical and Theoretical Performance

Experimental results confirm that Lyndon grammar-based algorithms are efficient on large, highly repetitive datasets:

Datasets comprising gigabase-scale genomic collections (human haplotypes, SARS-CoV-2 genomes), with up to $6 \times 10^{10}$ symbols, are processed.
Compared with SA-IS, Big-BWT, recursive PFP, and other advanced BWT tools, the Lyndon grammar approach yields BWT/eBWT up to $2$– $3\times$ faster or using $2$– $5\times$ less RAM, especially under multithreading.
Grammar construction is the bottleneck (~98% of total runtime in fastest variants), yet even naïve implementations outperform many specialized parsers for real data.
Parallel scaling is favorable: up to 32 threads, wall time reduces to approx $1/10$th baseline; memory increases moderately to accommodate thread-safe data structures (Olbrich, 27 Apr 2025).

Theoretical guarantees ensure that grammar size is minimized by merging identical subtrees, with worst-case size linear in input but much smaller on repetitive texts. Deep grammars arise only in contrived worst-case patterns (e.g., $a^{n-1}b$ ) (Olbrich, 27 Apr 2025, Badkobeh et al., 2020).

6. Limitations and Open Directions

Key limitations include:

Lyndon grammar construction requires $O(g)$ working memory and may be $\Theta(N)$ in the worst case, although practical performance is superior on repetitive inputs.
Suffix-comparison variants trade off simplicity and speed; efficient hashing or balanced tree variants mitigate worst-case behavior at the cost of algorithmic complexity (Olbrich, 27 Apr 2025).
The minimal grammar problem remains NP-hard; Lyndon-based SLPs are not always size-optimal compared to Sequitur, RePair, or other compressors, though they are more canonical and better-suited for certain combinatorial properties (Badkobeh et al., 2020).
Future work directions include improved parallelization, support for very large diverse collections with 64-bit symbol tables, and seeking faster algorithms for Lyndon factorization when input is in LZ78 or other compressed forms (Olbrich, 27 Apr 2025, I et al., 2013).

7. Connections to Other Compressed Computation and Stringology

The Lyndon grammar approach unifies combinatorics on words, compressed data structures, and efficient algorithm design. It provides a means to:

Relate grammar-based compression to lex order and repetitive structure via the unique properties of Lyndon words;
Extend SLP techniques with canonical binary trees reflecting periodicities in input, benefiting pattern matching and suffix sorting tasks;
Suggest new optima in run-length encoding for BWT, self-indexing, and compressed computation in massive genomics or text repositories (Olbrich, 27 Apr 2025, Tsuruta et al., 2020, Badkobeh et al., 2020).

A plausible implication is that the Lyndon grammar approach, by canonically capturing repetitive structure, can serve as a foundation for advanced compressed computation (indexing, matching, transform computation) where both theory and practice require handling extremely large, repetitive datasets efficiently.

Markdown Report Issue Upgrade to Chat

References (4)

Fast and memory-efficient BWT construction of repetitive texts using Lyndon grammars (2025)

Grammar-compressed Self-index with Lyndon Words (2020)

Left Lyndon tree construction (2020)

Efficient Lyndon factorization of grammar compressed text (2013)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lyndon Grammar Approach.