Infini-gram: Scalable Corpus-Statistics Engine

Updated 23 December 2025

Infini-gram is a corpus-statistics engine that computes scalable, low-latency n-gram and entity co-occurrence counts on massive text corpora for objective knowledge verification in RAG systems.
It employs a compressed suffix array FM-index variant to achieve millisecond-level query performance over a 4-trillion token index, ensuring dynamic retrieval integration.
The engine enhances retrieval-augmented generation by objectively quantifying uncertainty and reducing hallucinations, with empirical gains of up to 14 EM in QA benchmarks.

Infini-gram is a corpus-statistics engine designed to provide scalable, low-latency counts of n-grams and entity co-occurrences over massive text corpora, enabling objective knowledge verification and uncertainty quantification in retrieval-augmented generation (RAG) systems. Infini-gram is integral to corpus-grounded uncertainty estimation pipelines such as QuCo-RAG, which deploys millisecond-latency Infini-gram queries on an index of 4 trillion tokens for dynamic retrieval triggering (Min et al., 22 Dec 2025).

1. Formal Definition and System Architecture

Infini-gram implements a suffix array–based infrastructure to support rapid queries for n-gram frequency and entity co-occurrence statistics on large-scale corpora. The core data structure is a compressed suffix array, specifically an FM-index variant, optimized for both memory footprint and query throughput. The system exposes the following APIs:

count_ngram(ngram): Returns the frequency of the specified n-gram in the corpus.
count_cooc(entity1, entity2, window_size): Returns the count of occurrences where both entities appear within a sliding window of the specified size (typically 1,000 tokens).

Query operations over the entire 4T-token index demonstrate millisecond-level latency, suitable for real-time integration during LLM inference (Min et al., 22 Dec 2025).

2. Role in Uncertainty Quantification for RAG

Infini-gram provides corpus-grounded statistics for uncertainty quantification in dynamic RAG. Instead of relying on model-internal signals such as entropy or logit variance—which are unreliable due to LLM calibration failures—pipelines such as QuCo-RAG leverage Infini-gram's statistics to detect knowledge gaps and hallucination risks in two main stages:

Pre-generation knowledge assessment: For each entity $e$ in the prompt, query $\mathrm{freq}(e;\mathcal{P})$ using Infini-gram. If the average entity frequency falls below a threshold ( $\tau_{\mathrm{entity}}=10^3$ ), retrieval is triggered preemptively.
Runtime claim verification: During generation, knowledge triplets $(h, r, t)$ are extracted. Infini-gram computes $\mathrm{cooc}(h, t;\mathcal{P})$ within a window $\omega$ ; if co-occurrence drops below $\tau_{\mathrm{cooc}}=1$ , retrieval is triggered and the sentence is regenerated with the retrieved evidence (Min et al., 22 Dec 2025).

This approach shifts uncertainty estimation from subjective token-level signals to calibrated, corpus-derived statistics, addressing the problem of confident hallucinations in LLMs.

3. Algorithmic Workflow and Pseudocode

The interaction between QuCo-RAG and Infini-gram can be summarized by the following pseudocode fragments:

def query_freq(e):
    return InfiniNgram.count_ngram(e)

def query_cooc(h, t, ω):
    return InfiniNgram.count_cooccurrence(h, t, window=ω)

The overall dynamic retrieval workflow is as follows:

Input: question Q, ext. KB C, pre-train corpus P, thresholds τ_entity, τ_cooc
E_Q ← extract_entities(Q)
avg_f ← mean(query_freq(e) for e in E_Q)
if avg_f < τ_entity:
    D_0 ← Retrieve(Q, C)
    context ← [D_0; Q]
else:
    context ← [Q]
y ← []
for i = 1..N:
    s_i ← LLM.generate_sentence(context)
    y.append(s_i)
    T ← extract_triplets(s_i)
    if min(query_cooc(h,t) for (h,r,t) in T) < τ_cooc:
        q_i ← form_query(head=h, relation=r)
        D_i ← Retrieve(q_i, C)
        context ← [D_i; context_without_s_i]
        s_i ← LLM.regenerate_sentence(context)
        y[-1] ← s_i
    context ← context ∥ s_i
return join(y)

All corpus-level counts are supplied by Infini-gram, which is queried online and does not require LLM retraining or internal modification (Min et al., 22 Dec 2025).

4. Integration and Model-Agnostic Application

Infini-gram is strictly external to the LLM; it does not interact with model logits, hidden states, or parameters. When knowledge gaps or hallucination triggers are detected, retrievals produced due to Infini-gram queries are prepended to the LLM context as plain text. No instruction tuning or fine-tuning is required for integration, enabling seamless deployment across models with transparent or undisclosed pre-training corpora (e.g., OLMo-2, Llama, Qwen, GPT) (Min et al., 22 Dec 2025).

5. Experimental Impact in RAG Pipelines

Empirical studies with Infini-gram–enabled QuCo-RAG report substantial improvements in multi-hop QA benchmarks. With OLMo-2 family LLMs, QuCo-RAG achieves 5–12 point EM gains compared to the best dynamic baselines, and even higher (up to +14 EM) when transferring to models with different pre-training data. Biomedical QA tasks demonstrate robust domain generalization, with Infini-gram supporting accurate detection of novel entities (low frequency) and unsupported factual claims (zero co-occurrence), leading to reduced hallucinations and improved answer accuracy (Min et al., 22 Dec 2025).

Performance is achieved with less than 3 retrievals per question and consistent sub–10 ms query latency, attributed to the efficiency of the suffix array and FM-index–based architecture.

6. Generalization, Limitations, and Extensions

Generalization: Infini-gram’s support for explicit n-gram/entity queries generalizes across domains, enabling use in both general-knowledge and specialized biomedical QA.
Limitations: Surface-form matching limits entity alias detection; evolving knowledge bases require periodic re-indexing as corpus facts post-date the index cutoff. Infini-gram operates on the static pre-training corpus and cannot compensate for wholly unseen information (Min et al., 22 Dec 2025).
Extensions: Proposed directions include multilingual indexing for cross-lingual query support, time-stamped indexes to enable temporal reasoning, and expanding to event co-occurrences and quantitative/numeric claims.

7. Infini-gram in the Broader RAG Ecosystem

Infini-gram complements traditional vector search and semantic retrieval methods by providing orthogonal and interpretable corpus statistics for real-time verification. As RAG systems increasingly depend on both retrieval quality and verification fidelity, Infini-gram’s inclusion in objective uncertainty estimation pipelines—such as QuCo-RAG—marks a shift toward corpus-grounded, model-agnostic QA and dynamic evidence integration (Min et al., 22 Dec 2025).

A plausible implication is that the widespread adoption of Infini-gram–like engines will drive the development of corpus-aware generation protocols and more transparent QA pipelines in large-scale systems.

PDF Markdown Chat (Pro)

References (1)

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation (2025)

Infini-gram: Scalable Corpus-Statistics Engine

1. Formal Definition and System Architecture

2. Role in Uncertainty Quantification for RAG

3. Algorithmic Workflow and Pseudocode

4. Integration and Model-Agnostic Application

5. Experimental Impact in RAG Pipelines

6. Generalization, Limitations, and Extensions

7. Infini-gram in the Broader RAG Ecosystem

Whiteboard

Follow Topic

Continue Learning

Infini-gram: Scalable Corpus-Statistics Engine

1. Formal Definition and System Architecture

2. Role in Uncertainty Quantification for RAG

3. Algorithmic Workflow and Pseudocode

4. Integration and Model-Agnostic Application

5. Experimental Impact in RAG Pipelines

6. Generalization, Limitations, and Extensions

7. Infini-gram in the Broader RAG Ecosystem

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics