Papers
Topics
Authors
Recent
2000 character limit reached

Generalized Run-Length Encodings

Updated 11 January 2026
  • Generalized run-length encodings (GRLEs) are advanced compression schemes that extend classical RLE by encoding structured, noncontiguous repeated patterns with algebraic and combinatorial frameworks.
  • They leverage diverse methodologies including alphabet partitioning, grammar-based techniques, and geometric positional modeling to enhance data indexing and compression efficiency.
  • GRLE frameworks impact formal language theory and combinatorial enumeration, offering algorithmic decidability and strong statistical properties for real-time applications.

A generalized run-length encoding (GRLE) subsumes the classical run-length encoding paradigm by systematically extending the representation of repeated patterns from contiguous single-symbol runs to broader, structured, or abstracted classes, often exploiting combinatorial, algebraic, or positional symmetries. The development of GRLEs draws heavily on algebraic structures, grammar-theoretic compression, and combinatorial enumeration, and they encompass several key frameworks including alphabet partitioning, grammar-based compression, and geometric/positional generalizations. GRLEs have significant implications for text compression, formal language theory, combinatorics, and efficient data structure design.

1. Algebraic and Combinatorial Foundation

Let Σ\Sigma be a finite alphabet partitioned as Σ=Σ1Σm\Sigma = \Sigma_1 \uplus \dots \uplus \Sigma_m, where each Σi\Sigma_i is disjoint and nonempty. A GRLE of a word wΣw \in \Sigma^* is a vector

v=((C1,k1),(C2,k2),,(Cr,kr))v = ((C_1, k_1), (C_2, k_2), \dots, (C_r, k_r))

with each CjC_j a multiset of letters from exactly one partition Σij\Sigma_{i_j}, kj=Cjk_j = |C_j|, and ijij+1i_j \ne i_{j+1} for all jj (i.e., runs alternate blocks). The length of the encoding is n=k1++krn = k_1 + \dots + k_r; rr is the number of runs or blocks. The structure enforces that consecutive runs must belong to different blocks in the partition, preventing repeated runs from the same partition set (Khormali et al., 4 Jan 2026).

An important special case is the partition-level GRLE, recording only the block-label sequence (i1,,ir)(i_1, \dots, i_r) and corresponding lengths (k1,,kr)(k_1, \dots, k_r), abstracted from specific underlying symbols. This framework encapsulates alternation constraints and underlies several enumeration results.

The enumeration of surjective partition-level GRLEs (where all partitions appear at least once) of length nn over mm subsets is: {surjective GRLEs}=m!S(n,m)| \{\text{surjective GRLEs} \}| = m! \, S(n, m) where S(n,m)S(n, m) is the Stirling number of the second kind, reflecting the count of set partitions of nn elements into mm nonempty labeled classes. The generating function for total partition-level GRLEs (with or without surjectivity constraint) is

n0GRLEntn=11mt\sum_{n \geq 0} | \text{GRLE}_n | \, t^n = \frac{1}{1 - m t}

(Khormali et al., 4 Jan 2026).

2. Grammar-Based Generalizations: RLCFGs and RLSLPs

Run-length context-free grammars (RLCFGs) generalize both classical RLE and standard context-free grammars by permitting nonterminal rules of the form ABsA \rightarrow B^s, with BB either a single symbol or a nonterminal, and s2s \ge 2. This allows run-length compression of entire substrings or blocks, not just individual symbols. By definition, every standard CFG is also an RLCFG, but minimal RLCFGs may be asymptotically more compact for highly repetitive texts (Navarro et al., 2024).

A key development is the run-length straight-line program (RLSLP), a grammar with productions

  • ABCA \to BC (binary concatenation),
  • ABtA \to B^t (run-length; t3t \ge 3),
  • AaA \to a (terminal).

Any RLSLP can be balanced to logarithmic height with linear increase in grammar size, yielding efficient O(logn)O(\log n)-time random access and composable substring queries (e.g., range-minima, Karp–Rabin fingerprinting) under O(grl)O(g_{rl}) space, where grlg_{rl} is the grammar size (Navarro et al., 2024). This consolidates GRLEs as a central object for both expressiveness and algorithmic efficiency.

3. Structural and Positional Generalizations

GRLE schemes extend beyond block-alternation and grammar-based approaches to encompass positional and geometric aspects.

A notable example is the concentric circle model for generalized RLE, where a sequence SS is partitioned into substrings B1,,BKB_1, \dots, B_K (each without repeated symbols) and mapped to concentric circles. Symbols are then aligned across radii by rotating substrings, such that runs of identical symbols may consist of noncontiguous positions but are recorded as tuples (c,r1,r2)(c, r_1, r_2), denoting symbol cc and its span over radii r1r_1 to r2r_2 (Venkatram, 2021). This framework exploits positional redundancy—facilitating run formation not just on contiguous substrings but also structurally or geometrically related positions.

The storage cost for this "concentric code" is L+3E+B/8L + 3E + \lceil B/8 \rceil bytes (LL literals, EE generalized runs, BB bit-flags per position). It can outperform classical RLE (of cost L+2EL + 2E') when positional redundancy is strong (EEE' \gg E).

4. Algorithmic Decidability and Formal Language Connections

Given the broader expressiveness of GRLEs, existence and membership problems involving grammar-constrained patterns require new algorithmic frameworks.

For a fixed GRLE pattern v=((C1,k1),,(Cr,kr))v = ((C_1,k_1), \dots, (C_r,k_r)), the language of all words having precisely GRLE vv is regular (constructible via DFA of size 1+kj1+\sum k_j) (Khormali et al., 4 Jan 2026). For dictionaries DD and context-free grammars GG over DD, the set of spelled-out letter-sequences from derivations of GG forms a context-free language. The problem of deciding, for fixed vv, if GG generates a word with GRLE vv reduces to emptiness for the intersection of a regular and a context-free language—hence polynomial-time decidable. The relevant automata can be constructed in time O(Dv+v3G)O(|D| \cdot |v| + |v|^3 |G|).

This framework generalizes block-pattern constraints and demonstrates the integration of GRLEs into the algorithmic toolkit of formal languages and symbolic computation.

5. Impact on Compression and Data Structures

The compressibility of a sequence under GRLE is sharply connected to the number and arrangement of runs, particularly in frameworks such as the extended Burrows–Wheeler Transform (eBWT). Decomposition choices for eBWT impact the number of runs by unbounded factors (for certain infinite families, the worst-to-best run-count ratio is unbounded), making brute-force optimization infeasible (Ingels et al., 5 Jun 2025). Efficient heuristics seek decompositions that reduce run-count while keeping block sizes sufficiently large for high compressibility but manageable for decoding and indexing.

RLCFGs and RLSLPs can compress beyond classical bounds in highly repetitive texts and allow succinct support for advanced queries (substring extraction, pattern counting, range queries) in sublinear or near-optimal time (Navarro et al., 2024, Navarro et al., 2024).

6. Code Design and Information-Theoretic Considerations

Coding of run-lengths in finite domains (truncated geometric distributions) is another key aspect of GRLE theory. For finite universe run-lengths, near-Huffman-optimal prefix codes for run-lengths can be constructed using minimal storage and branch-free implementations, closely matching entropy with 0.05%0.05\% excess on average over the actual Huffman code (Larsson, 2019).

The core construction divides values into Golomb-like "bunches" and a tail, and emits codewords based on unary and balanced binary encoding, adapting efficiently to run bounds. These schemes provide almost-optimal information-theoretic performance with constant-time per-symbol computational cost, critical for real-time and distributed applications.

7. Connections to Limit Theorems and Statistical Laws

GRLEs provide a fertile setting for the study of run-length limit laws under constraints. For functions measuring the longest constrained run in binary expansions (with constraint sets of sub-exponential or exponential size), strong laws of large numbers generalize the Erdős–Rényi law: limnn(x,A)log2n=11τ\lim_{n \to \infty} \frac{\ell_n(x, \mathcal{A})}{\log_2 n} = \frac{1}{1 - \tau} Lebesgue almost everywhere, with exceptional sets having Hausdorff dimension at least 1τ1 - \tau for τ\tau the growth exponent of the constraint sets (Wu, 2022).

These results reveal deep connections between the combinatorial structure of GRLEs, symbolic dynamics, and fractal geometry.


Summary Table: GRLE Variants and Combinatorial Enumerations

GRLE Framework Structural Constraint Enumeration Formula
Partition-level GRLE Alternation of partition blocks m!S(n,m)m! S(n, m) (surjective)
RLCFG (Grammar-based) Runs can be nonterminal/substring repeats See (Navarro et al., 2024), size grlg_{rl}
Positional (Concentric) GRLE Noncontiguous but structurally aligned Data-dependent; see (Venkatram, 2021)

Generalized run-length encodings form a unifying abstraction for a rich class of sequence compressions extending RLE. They bridge combinatorics, algorithmic language theory, and statistical inference, and support efficient implementations across various domains including text indexing, symbolic compression, and pattern analysis (Khormali et al., 4 Jan 2026, Ingels et al., 5 Jun 2025, Navarro et al., 2024, Navarro et al., 2024, Wu, 2022, Venkatram, 2021, Larsson, 2019).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Generalized Run-Length Encodings.