Infini-gram: Scalable N-Gram Models
- Infini-gram Tool is a scalable framework that extends traditional n-gram models to support trillion-token corpora using unbounded context.
- It employs advanced data structures such as suffix arrays and FM-indices to achieve fast query times and efficient storage.
- The tool enhances next-token prediction and neural LLM performance by interpolating statistical and neural outputs to reduce perplexity.
The Infini-gram Tool is a modernized framework for large-scale n-gram statistics, designed to efficiently compute probabilistic LLMs with arbitrarily long context over trillion-token corpora. It leverages advanced data structures such as suffix arrays and FM-indices to scale classical n-gram LLM methodology into the regime of both internet-scale search and modern neural–statistical hybrid applications. Infini-gram and its close derivative, Infini-gram mini, underpin a suite of research tools for next-token prediction, linguistic analysis, benchmark contamination measurement, and retrieval-augmented LLM workflows.
1. Mathematical Model and Mechanism
Traditional n-gram LLMs estimate the next-token probability by counting occurrences of n-length sequences in a training corpus :
Infini-gram generalizes to be unbounded, introducing the -gram model that always backs off to the longest suffix observed in the dataset:
and thus
This scheme creates a consistent probability distribution without requiring explicit discounting, provided the context–suffix length selection depends only on observed history.
2. System Architecture and Data Structures
The Infini-gram engine is powered by a global suffix array built over the tokenized training corpus, which enables substring count queries for any context length . The index requires only 7 bytes per token (including a 5-byte pointer in the suffix array and 2 bytes for token storage), supporting corpora up to 5 trillion tokens. Query latency for n-gram counts is approximately 20 milliseconds, with full next-token distributions computed in under 200 milliseconds. Supporting methods include parallel shard processing, hinted search, and memory pre-fetching for rapid on-disk operation (Liu et al., 30 Jan 2024).
The companion Infini-gram mini system implements the FM-index (Burrows–Wheeler Transform with sampled suffix and inverse arrays) for massive, compressed on-disk search. The FM-index configuration achieves a practical index/storage multiplier of 0.44 corpus size—substantially lower than canonical suffix arrays (Xu et al., 13 Jun 2025).
3. Empirical Performance and Complementarity to Neural Models
A pure -gram LLM achieves 47\% next-token prediction accuracy on human-written text, compared to 29\% for a 5-gram model, with sparsity-limited contexts attaining up to 75–80\% token-level precision. Interpolating Infini-gram statistics with neural LLM outputs reduces model perplexity by up to 73\%, including with models as large as 70B parameters (Liu et al., 30 Jan 2024).
Experiments analyzing outputs from machine-generated text (greedy, temperature, nucleus sampling) uncover non-uniformities in neural LM–-gram agreement across context lengths, which reveals weaknesses in Transformer positional embedders and exposes memorization or statistical drift patterns that would be invisible to the neural component alone.
4. Engineering Scalability and Efficiency
Both Infini-gram engines (suffix array and FM-index variants) scale to corpora from terabytes to petabytes. For example, Infini-gram mini indexed 46TB of Internet text in 50 days on a single 128-core CPU node (or 19 hours with 75 nodes) (Xu et al., 13 Jun 2025). Indexing speed is 18 faster than the SDSL baseline, and memory use during both indexing and querying is reduced by a factor of 3.2.
Query time for a count is , where is query length and the corpus zeroth-order entropy. Even on extremely large and compressed indices, most queries complete in seconds.
5. Applications and Use Cases
Infini-gram’s primary application is the efficient construction, querying, and analysis of massive-scale n-gram LMs. Application domains as documented in the source material include:
- Language modeling: Construction of trillion-token scale -gram models for next-token prediction, model evaluation, and data contamination analysis.
- Improving neural LLMs: Interpolation of infini-gram and neural probabilities to reduce model perplexity and diagnose transformer weak spots by comparison of statistical continuation properties.
- Text analysis: Exploration of long-range patterning, memorization, and data contamination in large-scale corpora.
- Benchmark contamination detection: Automated, large-scale search for benchmark overlaps in pretraining data, calculating “contamination rates” such as for sets of test substrings. Findings indicate contamination rates up to 40\% in SQuAD, with implications for model evaluation reliability (Xu et al., 13 Jun 2025).
- Document retrieval and dataset curation: High-speed, exact-match n-gram document search, enabling data decontamination, dataset construction, and phrase-based information retrieval.
A summary table of application domains and system features:
| Application | Engine/Index | Performance Metric |
|---|---|---|
| Next-token prediction | Suffix array | 47% accuracy (human-written text) |
| Benchmark contamination search | FM-index (mini) | Up to 40% contamination detected |
| Retrieval-augmented LLM hybridization | Both | 73% perplexity reduction |
6. User Interfaces and Accessibility
Infini-gram mini is accessible through both a web interface (infini-gram-mini.io/demo) for interactive exploration and an API endpoint (api.infini-gram-mini.io) for programmatic queries. The interface supports n-gram count, substring search, and document retrieval functions, democratizing access to petabyte-scale data resources and enabling reproducible benchmark analysis without reconstructing indices (Xu et al., 13 Jun 2025).
7. Theoretical Context, Limitations, and Future Directions
Theoretical and empirical work indicates that while transformers are provably capable of representing any n-gram LM given sufficient architectural specificity (e.g., number of heads/layers, sparse attention), count-based n-gram estimators with appropriate smoothing outperform neural models when next-symbol probabilities are arbitrary and lack parameter sharing (Svete et al., 3 Oct 2024). When parameter sharing via representations is possible, transformers excel, especially with dense embeddings and sparse attention.
Document retrieval latency in the compressed FM-index remains slower than count queries due to additional random disk I/O. Improvements such as disk page prefetching are suggested to reduce latency. The system roadmap includes expanding indexed corpora, more frequent corpus updates, and extended support for community-submitted benchmarks to continuously monitor dataset contamination (Xu et al., 13 Jun 2025).
8. Significance and Research Implications
Infini-gram modernizes and revitalizes n-gram language modeling in the era of internet-scale text and neural networks. By bridging statistical and neural worlds—in both stand-alone language modeling and as a diagnostic or augmenting tool for neural LLMs—the framework provides new methodological leverage for large-scale text analysis, evaluation robustness, and data-centric research. Its advances in indexing and query efficiency are foundational for practical deployment in both academic and applied settings, encompassing direct linguistic analysis, retrieval-augmented generation, and responsible data curation.