Papers
Topics
Authors
Recent
2000 character limit reached

Infini-gram Engine: Scalable n-gram Analysis

Updated 27 November 2025
  • Infini-gram Engine is an advanced computational architecture that scales n-gram analysis and probabilistic language modeling to unprecedented corpus sizes.
  • It employs diverse methodologies—suffix arrays with backoff, FM-index/BWT, and multi-pass intergrams—to compute exact n-gram statistics under bounded time and memory constraints.
  • The engine is applied in language modeling, benchmark decontamination, and high-resolution image synthesis, offering robust theoretical guarantees and production-readiness.

An Infini-gram Engine is an advanced computational architecture and methodology for scaling nn-gram frequency analysis, probabilistic language modeling, and related exact search and generative tasks to corpora of unprecedented size (trillions of tokens, petabytes of data), and to gram orders nn that are functionally unbounded. Multiple technical realizations exist, sharing common objectives: to support efficient computation of arbitrary nn-gram statistics, probability estimates, or counts with rigorously bounded time/memory requirements, and to deploy these capabilities in production pipelines for LLMs, large-scale text mining, benchmark decontamination, or even high-resolution image synthesis under the "infini-gram" paradigm.

1. Mathematical and Algorithmic Foundations

Three principal algorithmic frameworks exist for Infini-gram Engines:

  • Suffix Array/Backoff \infty-gram LM: Models P(wiw1:i1)P_\infty(w_i | w_{1:i-1}) as the empirical frequency ratio for the longest history suffix observed in the data, subject to exact backoff. This generalizes nn-gram LMs to nn\to\infty, with inference ("lookup") driven by exact scanning over tokenized suffix arrays. The key estimates are

P(wiw1:i1)=cnt(wini+1:i)cnt(wini+1:i1),ni=max{1nicnt(wi(n1):i1)>0}P_{\infty}(w_i\mid w_{1:i-1}) = \frac{\mathrm{cnt}(w_{i-n_i+1:i})}{\mathrm{cnt}(w_{i-n_i+1:i-1})},\quad n_i = \max\{1\le n\le i\,|\,\mathrm{cnt}(w_{i-(n-1):i-1})>0\}

where w1:i1w_{1:i-1} is the context and cnt()cnt(\cdot) counts in the corpus (Liu et al., 30 Jan 2024).

  • FM-index and Burrows–Wheeler Transform (BWT): A compressed self-index data structure supporting fast (sublinear in corpus size) search and retrieval by leveraging BWT, wavelet trees, and sampled suffix/inverse-suffix arrays. It trades minor per-query computational cost for large reductions in index size, enabling Internet-scale deployment for document-anchored substring queries (Xu et al., 13 Jun 2025).
  • Multi-pass Deterministic Intergrams: For efficient extraction of top-k frequent nn-grams (large nn, large kk), this strategy sequentially grows prefix sets via multi-stage passes, maintaining only an oversampled set of candidate prefixes at each pass to avoid the O(An)O(A^n) explosion. It is provably near-optimal for empirical Zipf-like distributions, with tight recall bounds and order-of-magnitude empirical speedups (Curtin et al., 18 Nov 2025).

Each formulation is designed to support arbitrarily large nn with I/O, memory, and compute resources that are manageable even as nn and DD (data scale) increase.

2. Core Data Structures and Systems Engineering

Suffix Array-based Infini-gram

  • Token Array: The corpus is represented as a byte array, with two bytes per token for fixed vocabulary (V<216|\mathcal V|<2^{16}). For NN tokens, storage is $2N$ bytes.
  • Suffix Array (SA): Stores, for each suffix index kk, the offset where the kkth lexicographically smallest suffix begins. Typical storage overhead is O(NlogN)O(N\log N) bits. Sharding is required for large NN (e.g., RedPajama, Dolma).
  • Supporting Files: Doc-offset tables for reconstructing document boundaries; metadata for alignment with original corpora.
  • Querying: Binary search determines all occurrences of an nn-gram. Amortized query time for next-token probabilities is O(logN)O(\log N) per token under streaming.

FM-index and BWT-based Infini-gram Mini

  • BWT Array: Encodes permutations of the corpus for efficient backward substring search.
  • Huffman-shaped Wavelet Tree: Over the 256-byte symbol alphabet, supports rankrank/selectselect queries in O(H0)O(H_0) time (empirical entropy).
  • SA/ISA Sampling: Stores sparse suffix/inverse-suffix arrays (parameters aa, bb) to bound memory and support position reconstruction.
  • Memory-mapped Index: Enables sub-700 GB shards on moderate (2 TiB RAM) hardware, with on-demand loading for queries.

Intergrams Hardware-Aware Bit-Vector Approach

  • Local Bit-vectors: Cache-resident per-thread arrays (AzkA\cdot zk bits) track presence of candidate nn-gram prefixes within L2/L3.
  • Global Count Arrays: Aggregation via sequential SIMD flushes minimizes random accesses.
  • Prefix Trie: For fast prefix eligibility, a compact trie supports O(1)O(1) checks. Sharding/distributed aggregation is inherently supported.

These technical designs ensure linear I/O or sublinear per-query compute, and enable operation at scales (trillions of tokens, nn-gram orders exceeding 20) previously impractical.

3. Performance, Scalability and Empirical Results

  • Compression and Throughput: FM-index enables a 44%44\% index-to-corpus size ratio (7% of a full text+suffix array baseline) and indexing throughput of \sim2 GB/s per 128-core node, 18×\times faster than best SDSL implementations (Xu et al., 13 Jun 2025).
  • Query Latencies: On SSDs, counting nn-grams of length up to $10$ occurs in <0.4<0.4s per query, while counts for n=1000n=1000 cost up to $25$s (sum across shards). Suffix-array-based engines report $20$ms (counts), $30$ms (fixed-nn next-token prob\operatorname{prob}), and $135$ms (\infty-gram LM) per query on a single SSD node (Liu et al., 30 Jan 2024).
  • Intergrams Algorithm: For n=6n=6, k=100k=100k, Intergrams achieves 5×5\times33×33\times speedup over tuned hash-gram baselines, with >99%>99\% recall at z=1.5z=1.5–$2$ oversampling (Curtin et al., 18 Nov 2025).
  • Scalability: Deployment to petascale is linear in the number of compute nodes; e.g., index 1 PB of Common Crawl in $19$ hours using 1,5001{,}500 $128$-core nodes (Xu et al., 13 Jun 2025).

4. Major Applications and Empirical Findings

  • Language Modeling: \infty-gram LMs provide high-accuracy (47%47\% next-token "argmax"), outperforming 5-gram baselines (29%) and matching rare situations (n14n\ge14) at >80%>80\% accuracy (Liu et al., 30 Jan 2024). Hybrid neural/\infty-gram models reduce Llama-2 70B perplexity by 18–19%, and GPT-2 by 42% absolute.
  • Benchmark Decontamination: Infini-gram mini enables exact substring matching and contamination studies for LLM benchmark test sets. Up to 40% of SQuAD, 27.7% of MMLU, and 32.6% of ARC-Challenge examples are contaminated by verbatim overlap in Internet crawls, raising concerns about benchmark validity (Xu et al., 13 Jun 2025).
  • Attribution, Copyright, and Explainability: Full-document reconstruction from nn-gram hits enables provenance tracing and intellectual property audits.
  • Novelty Metrics: Quantification of unseen nn-grams in generative outputs distinguishes human from machine-generated content.
  • Generative Image Synthesis (Resolution-Agnostic): In a distinct research branch, "Infini-gram Engine" as InfGen denotes a one-step, arbitrary-resolution image decoder superseding traditional VAEs in latent diffusion models. Here, computational complexity grows linearly with output pixels (O(hw)O(h\,w)), in contrast to the quadratic scaling and multi-step inference of classical diffusion (Han et al., 12 Sep 2025).

5. Theoretical Guarantees and Analysis

  • Intergrams Recall Bounds: Under empirical Zipf’s Law (pi1/iap_i\propto 1/i^a, a1.11.3a\approx1.1-1.3), the retained prefix set in Intergrams captures at least 99%99\% of top-kk nn-grams for realistic kk, as shown by explicit mass-transfer lemmas and sampling deviation bounds (Curtin et al., 18 Nov 2025).
  • \infty-gram Normalization: The conditional probabilities defined by longest-matching-suffix backoff sum to $1$, ensuring statistical soundness without further discounting (Liu et al., 30 Jan 2024).
  • Latency/Memory/Accuracy Trade-offs: FM-index and Intergrams both analytically bound memory and per-query time to low-order polynomials in data size and kk, with explicit sampling and oversampling factors controlling recall/precision.

6. Practical Considerations and Deployment

  • Hardware: 128 vCPU / 2 TiB RAM nodes suffice for even the largest single-shard indexes; any modern SSD is adequate for query nodes. Memory-mapped on-disk serving is standard (Xu et al., 13 Jun 2025).
  • Parallelism: All leading Infini-gram engines are "embarrassingly parallel" at the sharding or pass level. For Intergrams, per-shard processing is followed by heap merges or AllReduce steps.
  • Cache/TLB Optimization: Algorithm design exploits cache locality: all randomly indexed bit-vectors fit in L2/L3, with flushes sequenced for TLB efficiency.
  • Sampling Parameters: Tuning of aa, bb (FM-index) or zz (Intergrams) allows practitioners to trade index size, query latency, and recall as dictated by deployment constraints.
  • APIs and Web Interfaces: Web demos and RESTful APIs exist for both search/counting and document retrieval (Xu et al., 13 Jun 2025). Latencies are competitive at Internet scale.

7. Limitations, Future Directions, and Open Problems

  • Latency vs. In-memory Baselines: FM-index and suffix-array engines are slower than in-memory suffix arrays for reconstructing long passages, but require 2×2\times less storage at production scale (Xu et al., 13 Jun 2025).
  • Dynamic Updates and Open Vocabulary: Existing static SAs are not amenable to online updates; suffix automata or dynamic SAs are necessary for real-time or evolving corpora (Liu et al., 30 Jan 2024).
  • Extensions: Integration with compressed suffix arrays, wavelet trees, and hybrid approximate-exact filtering (e.g., Bloom filters on deep passes) is ongoing.
  • Distributed/Streaming: Large-scale, real-time top-kk tracking and distributed aggregation in streaming contexts remain open engineering challenges.
  • Image Synthesis: In InfGen, limitations arise for ultra-high resolutions (>8>8K), and for compatibility with non-VAE latent tokenizers—a retraining of the decoder is required (Han et al., 12 Sep 2025).

Infini-gram Engines thus represent a class of architectures and methods, unified by the ability to bring exact, high-order nn-gram computation to the Internet/data-lake scale with tractable hardware and software. Their applications impact the scientific paper of machine learning corpora, the deployment and auditing of LLMs, statistical analysis at web scale, and, in the case of InfGen, high-resolution generative visual content. Key limitations are recognized and form the basis of active research in both the theoretical and systems arenas.

Relevant works include:

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Infini-gram Engine.