Infini-gram Engine: Scalable n-gram Analysis

Updated 27 November 2025

Infini-gram Engine is an advanced computational architecture that scales n-gram analysis and probabilistic language modeling to unprecedented corpus sizes.
It employs diverse methodologies—suffix arrays with backoff, FM-index/BWT, and multi-pass intergrams—to compute exact n-gram statistics under bounded time and memory constraints.
The engine is applied in language modeling, benchmark decontamination, and high-resolution image synthesis, offering robust theoretical guarantees and production-readiness.

An Infini-gram Engine is an advanced computational architecture and methodology for scaling $n$ -gram frequency analysis, probabilistic language modeling, and related exact search and generative tasks to corpora of unprecedented size (trillions of tokens, petabytes of data), and to gram orders $n$ that are functionally unbounded. Multiple technical realizations exist, sharing common objectives: to support efficient computation of arbitrary $n$ -gram statistics, probability estimates, or counts with rigorously bounded time/memory requirements, and to deploy these capabilities in production pipelines for LLMs, large-scale text mining, benchmark decontamination, or even high-resolution image synthesis under the "infini-gram" paradigm.

1. Mathematical and Algorithmic Foundations

Three principal algorithmic frameworks exist for Infini-gram Engines:

Suffix Array/Backoff $\infty$ -gram LM: Models $P_\infty(w_i | w_{1:i-1})$ as the empirical frequency ratio for the longest history suffix observed in the data, subject to exact backoff. This generalizes $n$ -gram LMs to $n\to\infty$ , with inference ("lookup") driven by exact scanning over tokenized suffix arrays. The key estimates are

$P_{\infty}(w_i\mid w_{1:i-1}) = \frac{\mathrm{cnt}(w_{i-n_i+1:i})}{\mathrm{cnt}(w_{i-n_i+1:i-1})},\quad n_i = \max\{1\le n\le i\,|\,\mathrm{cnt}(w_{i-(n-1):i-1})>0\}$

where $w_{1:i-1}$ is the context and $cnt(\cdot)$ counts in the corpus (Liu et al., 2024).

FM-index and Burrows–Wheeler Transform (BWT): A compressed self-index data structure supporting fast (sublinear in corpus size) search and retrieval by leveraging BWT, wavelet trees, and sampled suffix/inverse-suffix arrays. It trades minor per-query computational cost for large reductions in index size, enabling Internet-scale deployment for document-anchored substring queries (Xu et al., 13 Jun 2025).
Multi-pass Deterministic Intergrams: For efficient extraction of top-k frequent $n$ -grams (large $n$ , large $k$ ), this strategy sequentially grows prefix sets via multi-stage passes, maintaining only an oversampled set of candidate prefixes at each pass to avoid the $O(A^n)$ explosion. It is provably near-optimal for empirical Zipf-like distributions, with tight recall bounds and order-of-magnitude empirical speedups (Curtin et al., 18 Nov 2025).

Each formulation is designed to support arbitrarily large $n$ with I/O, memory, and compute resources that are manageable even as $n$ and $D$ (data scale) increase.

2. Core Data Structures and Systems Engineering

Suffix Array-based Infini-gram

Token Array: The corpus is represented as a byte array, with two bytes per token for fixed vocabulary ( $|\mathcal V|<2^{16}$ ). For $N$ tokens, storage is $2N$ bytes.
Suffix Array (SA): Stores, for each suffix index $k$ , the offset where the $k$ th lexicographically smallest suffix begins. Typical storage overhead is $O(N\log N)$ bits. Sharding is required for large $N$ (e.g., RedPajama, Dolma).
Supporting Files: Doc-offset tables for reconstructing document boundaries; metadata for alignment with original corpora.
Querying: Binary search determines all occurrences of an $n$ -gram. Amortized query time for next-token probabilities is $O(\log N)$ per token under streaming.

FM-index and BWT-based Infini-gram Mini

BWT Array: Encodes permutations of the corpus for efficient backward substring search.
Huffman-shaped Wavelet Tree: Over the 256-byte symbol alphabet, supports $rank$ / $select$ queries in $O(H_0)$ time (empirical entropy).
SA/ISA Sampling: Stores sparse suffix/inverse-suffix arrays (parameters $a$ , $b$ ) to bound memory and support position reconstruction.
Memory-mapped Index: Enables sub-700 GB shards on moderate (2 TiB RAM) hardware, with on-demand loading for queries.

Intergrams Hardware-Aware Bit-Vector Approach

Local Bit-vectors: Cache-resident per-thread arrays ( $A\cdot zk$ bits) track presence of candidate $n$ -gram prefixes within L2/L3.
Global Count Arrays: Aggregation via sequential SIMD flushes minimizes random accesses.
Prefix Trie: For fast prefix eligibility, a compact trie supports $O(1)$ checks. Sharding/distributed aggregation is inherently supported.

These technical designs ensure linear I/O or sublinear per-query compute, and enable operation at scales (trillions of tokens, $n$ -gram orders exceeding 20) previously impractical.

3. Performance, Scalability and Empirical Results

Compression and Throughput: FM-index enables a $44\%$ index-to-corpus size ratio (7% of a full text+suffix array baseline) and indexing throughput of $\sim$ 2 GB/s per 128-core node, 18 $\times$ faster than best SDSL implementations (Xu et al., 13 Jun 2025).
Query Latencies: On SSDs, counting $n$ -grams of length up to $10$ occurs in $<0.4$ s per query, while counts for $n=1000$ cost up to $25$s (sum across shards). Suffix-array-based engines report $20$ms (counts), $30$ms (fixed- $n$ next-token $\operatorname{prob}$ ), and $135$ms ( $\infty$ -gram LM) per query on a single SSD node (Liu et al., 2024).
Intergrams Algorithm: For $n=6$ , $k=100$ k, Intergrams achieves $5\times$ – $33\times$ speedup over tuned hash-gram baselines, with $>99\%$ recall at $z=1.5$ –$2$ oversampling (Curtin et al., 18 Nov 2025).
Scalability: Deployment to petascale is linear in the number of compute nodes; e.g., index 1 PB of Common Crawl in $19$ hours using $1{,}500$ $128$-core nodes (Xu et al., 13 Jun 2025).

4. Major Applications and Empirical Findings

Language Modeling: $\infty$ -gram LMs provide high-accuracy ( $47\%$ next-token "argmax"), outperforming 5-gram baselines (29%) and matching rare situations ( $n\ge14$ ) at $>80\%$ accuracy (Liu et al., 2024). Hybrid neural/ $\infty$ -gram models reduce Llama-2 70B perplexity by 18–19%, and GPT-2 by 42% absolute.
Benchmark Decontamination: Infini-gram mini enables exact substring matching and contamination studies for LLM benchmark test sets. Up to 40% of SQuAD, 27.7% of MMLU, and 32.6% of ARC-Challenge examples are contaminated by verbatim overlap in Internet crawls, raising concerns about benchmark validity (Xu et al., 13 Jun 2025).
Attribution, Copyright, and Explainability: Full-document reconstruction from $n$ -gram hits enables provenance tracing and intellectual property audits.
Novelty Metrics: Quantification of unseen $n$ -grams in generative outputs distinguishes human from machine-generated content.
Generative Image Synthesis (Resolution-Agnostic): In a distinct research branch, "Infini-gram Engine" as InfGen denotes a one-step, arbitrary-resolution image decoder superseding traditional VAEs in latent diffusion models. Here, computational complexity grows linearly with output pixels ( $O(h\,w)$ ), in contrast to the quadratic scaling and multi-step inference of classical diffusion (Han et al., 12 Sep 2025).

5. Theoretical Guarantees and Analysis

Intergrams Recall Bounds: Under empirical Zipf’s Law ( $p_i\propto 1/i^a$ , $a\approx1.1-1.3$ ), the retained prefix set in Intergrams captures at least $99\%$ of top- $k$ $n$ -grams for realistic $k$ , as shown by explicit mass-transfer lemmas and sampling deviation bounds (Curtin et al., 18 Nov 2025).
$\infty$ -gram Normalization: The conditional probabilities defined by longest-matching-suffix backoff sum to $1$, ensuring statistical soundness without further discounting (Liu et al., 2024).
Latency/Memory/Accuracy Trade-offs: FM-index and Intergrams both analytically bound memory and per-query time to low-order polynomials in data size and $k$ , with explicit sampling and oversampling factors controlling recall/precision.

6. Practical Considerations and Deployment

Hardware: 128 vCPU / 2 TiB RAM nodes suffice for even the largest single-shard indexes; any modern SSD is adequate for query nodes. Memory-mapped on-disk serving is standard (Xu et al., 13 Jun 2025).
Parallelism: All leading Infini-gram engines are "embarrassingly parallel" at the sharding or pass level. For Intergrams, per-shard processing is followed by heap merges or AllReduce steps.
Cache/TLB Optimization: Algorithm design exploits cache locality: all randomly indexed bit-vectors fit in L2/L3, with flushes sequenced for TLB efficiency.
Sampling Parameters: Tuning of $a$ , $b$ (FM-index) or $z$ (Intergrams) allows practitioners to trade index size, query latency, and recall as dictated by deployment constraints.
APIs and Web Interfaces: Web demos and RESTful APIs exist for both search/counting and document retrieval (Xu et al., 13 Jun 2025). Latencies are competitive at Internet scale.

7. Limitations, Future Directions, and Open Problems

Latency vs. In-memory Baselines: FM-index and suffix-array engines are slower than in-memory suffix arrays for reconstructing long passages, but require $2\times$ less storage at production scale (Xu et al., 13 Jun 2025).
Dynamic Updates and Open Vocabulary: Existing static SAs are not amenable to online updates; suffix automata or dynamic SAs are necessary for real-time or evolving corpora (Liu et al., 2024).
Extensions: Integration with compressed suffix arrays, wavelet trees, and hybrid approximate-exact filtering (e.g., Bloom filters on deep passes) is ongoing.
Distributed/Streaming: Large-scale, real-time top- $k$ tracking and distributed aggregation in streaming contexts remain open engineering challenges.
Image Synthesis: In InfGen, limitations arise for ultra-high resolutions ( $>8$ K), and for compatibility with non-VAE latent tokenizers—a retraining of the decoder is required (Han et al., 12 Sep 2025).

Infini-gram Engines thus represent a class of architectures and methods, unified by the ability to bring exact, high-order $n$ -gram computation to the Internet/data-lake scale with tractable hardware and software. Their applications impact the scientific study of machine learning corpora, the deployment and auditing of LLMs, statistical analysis at web scale, and, in the case of InfGen, high-resolution generative visual content. Key limitations are recognized and form the basis of active research in both the theoretical and systems arenas.

Relevant works include:

"Infini-gram: Scaling Unbounded n-gram LLMs to a Trillion Tokens" (Liu et al., 2024)
"Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index" (Xu et al., 13 Jun 2025)
"Intermediate N-Gramming: Deterministic and Fast N-Grams For Large N and Large Datasets" (Curtin et al., 18 Nov 2025)
"InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis" (Han et al., 12 Sep 2025)