Infini-gram: Scalable Corpus-Statistics Engine
- Infini-gram is a corpus-statistics engine that computes scalable, low-latency n-gram and entity co-occurrence counts on massive text corpora for objective knowledge verification in RAG systems.
- It employs a compressed suffix array FM-index variant to achieve millisecond-level query performance over a 4-trillion token index, ensuring dynamic retrieval integration.
- The engine enhances retrieval-augmented generation by objectively quantifying uncertainty and reducing hallucinations, with empirical gains of up to 14 EM in QA benchmarks.
Infini-gram is a corpus-statistics engine designed to provide scalable, low-latency counts of n-grams and entity co-occurrences over massive text corpora, enabling objective knowledge verification and uncertainty quantification in retrieval-augmented generation (RAG) systems. Infini-gram is integral to corpus-grounded uncertainty estimation pipelines such as QuCo-RAG, which deploys millisecond-latency Infini-gram queries on an index of 4 trillion tokens for dynamic retrieval triggering (Min et al., 22 Dec 2025).
1. Formal Definition and System Architecture
Infini-gram implements a suffix array–based infrastructure to support rapid queries for n-gram frequency and entity co-occurrence statistics on large-scale corpora. The core data structure is a compressed suffix array, specifically an FM-index variant, optimized for both memory footprint and query throughput. The system exposes the following APIs:
count_ngram(ngram): Returns the frequency of the specified n-gram in the corpus.count_cooc(entity1, entity2, window_size): Returns the count of occurrences where both entities appear within a sliding window of the specified size (typically 1,000 tokens).
Query operations over the entire 4T-token index demonstrate millisecond-level latency, suitable for real-time integration during LLM inference (Min et al., 22 Dec 2025).
2. Role in Uncertainty Quantification for RAG
Infini-gram provides corpus-grounded statistics for uncertainty quantification in dynamic RAG. Instead of relying on model-internal signals such as entropy or logit variance—which are unreliable due to LLM calibration failures—pipelines such as QuCo-RAG leverage Infini-gram's statistics to detect knowledge gaps and hallucination risks in two main stages:
- Pre-generation knowledge assessment: For each entity in the prompt, query using Infini-gram. If the average entity frequency falls below a threshold (), retrieval is triggered preemptively.
- Runtime claim verification: During generation, knowledge triplets are extracted. Infini-gram computes within a window ; if co-occurrence drops below , retrieval is triggered and the sentence is regenerated with the retrieved evidence (Min et al., 22 Dec 2025).
This approach shifts uncertainty estimation from subjective token-level signals to calibrated, corpus-derived statistics, addressing the problem of confident hallucinations in LLMs.
3. Algorithmic Workflow and Pseudocode
The interaction between QuCo-RAG and Infini-gram can be summarized by the following pseudocode fragments:
1 2 3 4 5 |
def query_freq(e): return InfiniNgram.count_ngram(e) def query_cooc(h, t, ω): return InfiniNgram.count_cooccurrence(h, t, window=ω) |
The overall dynamic retrieval workflow is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Input: question Q, ext. KB C, pre-train corpus P, thresholds τ_entity, τ_cooc E_Q ← extract_entities(Q) avg_f ← mean(query_freq(e) for e in E_Q) if avg_f < τ_entity: D_0 ← Retrieve(Q, C) context ← [D_0; Q] else: context ← [Q] y ← [] for i = 1..N: s_i ← LLM.generate_sentence(context) y.append(s_i) T ← extract_triplets(s_i) if min(query_cooc(h,t) for (h,r,t) in T) < τ_cooc: q_i ← form_query(head=h, relation=r) D_i ← Retrieve(q_i, C) context ← [D_i; context_without_s_i] s_i ← LLM.regenerate_sentence(context) y[-1] ← s_i context ← context ∥ s_i return join(y) |
All corpus-level counts are supplied by Infini-gram, which is queried online and does not require LLM retraining or internal modification (Min et al., 22 Dec 2025).
4. Integration and Model-Agnostic Application
Infini-gram is strictly external to the LLM; it does not interact with model logits, hidden states, or parameters. When knowledge gaps or hallucination triggers are detected, retrievals produced due to Infini-gram queries are prepended to the LLM context as plain text. No instruction tuning or fine-tuning is required for integration, enabling seamless deployment across models with transparent or undisclosed pre-training corpora (e.g., OLMo-2, Llama, Qwen, GPT) (Min et al., 22 Dec 2025).
5. Experimental Impact in RAG Pipelines
Empirical studies with Infini-gram–enabled QuCo-RAG report substantial improvements in multi-hop QA benchmarks. With OLMo-2 family LLMs, QuCo-RAG achieves 5–12 point EM gains compared to the best dynamic baselines, and even higher (up to +14 EM) when transferring to models with different pre-training data. Biomedical QA tasks demonstrate robust domain generalization, with Infini-gram supporting accurate detection of novel entities (low frequency) and unsupported factual claims (zero co-occurrence), leading to reduced hallucinations and improved answer accuracy (Min et al., 22 Dec 2025).
Performance is achieved with less than 3 retrievals per question and consistent sub–10 ms query latency, attributed to the efficiency of the suffix array and FM-index–based architecture.
6. Generalization, Limitations, and Extensions
- Generalization: Infini-gram’s support for explicit n-gram/entity queries generalizes across domains, enabling use in both general-knowledge and specialized biomedical QA.
- Limitations: Surface-form matching limits entity alias detection; evolving knowledge bases require periodic re-indexing as corpus facts post-date the index cutoff. Infini-gram operates on the static pre-training corpus and cannot compensate for wholly unseen information (Min et al., 22 Dec 2025).
- Extensions: Proposed directions include multilingual indexing for cross-lingual query support, time-stamped indexes to enable temporal reasoning, and expanding to event co-occurrences and quantitative/numeric claims.
7. Infini-gram in the Broader RAG Ecosystem
Infini-gram complements traditional vector search and semantic retrieval methods by providing orthogonal and interpretable corpus statistics for real-time verification. As RAG systems increasingly depend on both retrieval quality and verification fidelity, Infini-gram’s inclusion in objective uncertainty estimation pipelines—such as QuCo-RAG—marks a shift toward corpus-grounded, model-agnostic QA and dynamic evidence integration (Min et al., 22 Dec 2025).
A plausible implication is that the widespread adoption of Infini-gram–like engines will drive the development of corpus-aware generation protocols and more transparent QA pipelines in large-scale systems.