Papers
Topics
Authors
Recent
2000 character limit reached

Infini-gram: Scalable Corpus-Statistics Engine

Updated 23 December 2025
  • Infini-gram is a corpus-statistics engine that computes scalable, low-latency n-gram and entity co-occurrence counts on massive text corpora for objective knowledge verification in RAG systems.
  • It employs a compressed suffix array FM-index variant to achieve millisecond-level query performance over a 4-trillion token index, ensuring dynamic retrieval integration.
  • The engine enhances retrieval-augmented generation by objectively quantifying uncertainty and reducing hallucinations, with empirical gains of up to 14 EM in QA benchmarks.

Infini-gram is a corpus-statistics engine designed to provide scalable, low-latency counts of n-grams and entity co-occurrences over massive text corpora, enabling objective knowledge verification and uncertainty quantification in retrieval-augmented generation (RAG) systems. Infini-gram is integral to corpus-grounded uncertainty estimation pipelines such as QuCo-RAG, which deploys millisecond-latency Infini-gram queries on an index of 4 trillion tokens for dynamic retrieval triggering (Min et al., 22 Dec 2025).

1. Formal Definition and System Architecture

Infini-gram implements a suffix array–based infrastructure to support rapid queries for n-gram frequency and entity co-occurrence statistics on large-scale corpora. The core data structure is a compressed suffix array, specifically an FM-index variant, optimized for both memory footprint and query throughput. The system exposes the following APIs:

  • count_ngram(ngram): Returns the frequency of the specified n-gram in the corpus.
  • count_cooc(entity1, entity2, window_size): Returns the count of occurrences where both entities appear within a sliding window of the specified size (typically 1,000 tokens).

Query operations over the entire 4T-token index demonstrate millisecond-level latency, suitable for real-time integration during LLM inference (Min et al., 22 Dec 2025).

2. Role in Uncertainty Quantification for RAG

Infini-gram provides corpus-grounded statistics for uncertainty quantification in dynamic RAG. Instead of relying on model-internal signals such as entropy or logit variance—which are unreliable due to LLM calibration failures—pipelines such as QuCo-RAG leverage Infini-gram's statistics to detect knowledge gaps and hallucination risks in two main stages:

  1. Pre-generation knowledge assessment: For each entity ee in the prompt, query freq(e;P)\mathrm{freq}(e;\mathcal{P}) using Infini-gram. If the average entity frequency falls below a threshold (τentity=103\tau_{\mathrm{entity}}=10^3), retrieval is triggered preemptively.
  2. Runtime claim verification: During generation, knowledge triplets (h,r,t)(h, r, t) are extracted. Infini-gram computes cooc(h,t;P)\mathrm{cooc}(h, t;\mathcal{P}) within a window ω\omega; if co-occurrence drops below τcooc=1\tau_{\mathrm{cooc}}=1, retrieval is triggered and the sentence is regenerated with the retrieved evidence (Min et al., 22 Dec 2025).

This approach shifts uncertainty estimation from subjective token-level signals to calibrated, corpus-derived statistics, addressing the problem of confident hallucinations in LLMs.

3. Algorithmic Workflow and Pseudocode

The interaction between QuCo-RAG and Infini-gram can be summarized by the following pseudocode fragments:

1
2
3
4
5
def query_freq(e):
    return InfiniNgram.count_ngram(e)

def query_cooc(h, t, ω):
    return InfiniNgram.count_cooccurrence(h, t, window=ω)

The overall dynamic retrieval workflow is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Input: question Q, ext. KB C, pre-train corpus P, thresholds τ_entity, τ_cooc
E_Q  extract_entities(Q)
avg_f  mean(query_freq(e) for e in E_Q)
if avg_f < τ_entity:
    D_0  Retrieve(Q, C)
    context  [D_0; Q]
else:
    context  [Q]
y  []
for i = 1..N:
    s_i  LLM.generate_sentence(context)
    y.append(s_i)
    T  extract_triplets(s_i)
    if min(query_cooc(h,t) for (h,r,t) in T) < τ_cooc:
        q_i  form_query(head=h, relation=r)
        D_i  Retrieve(q_i, C)
        context  [D_i; context_without_s_i]
        s_i  LLM.regenerate_sentence(context)
        y[-1]  s_i
    context  context  s_i
return join(y)

All corpus-level counts are supplied by Infini-gram, which is queried online and does not require LLM retraining or internal modification (Min et al., 22 Dec 2025).

4. Integration and Model-Agnostic Application

Infini-gram is strictly external to the LLM; it does not interact with model logits, hidden states, or parameters. When knowledge gaps or hallucination triggers are detected, retrievals produced due to Infini-gram queries are prepended to the LLM context as plain text. No instruction tuning or fine-tuning is required for integration, enabling seamless deployment across models with transparent or undisclosed pre-training corpora (e.g., OLMo-2, Llama, Qwen, GPT) (Min et al., 22 Dec 2025).

5. Experimental Impact in RAG Pipelines

Empirical studies with Infini-gram–enabled QuCo-RAG report substantial improvements in multi-hop QA benchmarks. With OLMo-2 family LLMs, QuCo-RAG achieves 5–12 point EM gains compared to the best dynamic baselines, and even higher (up to +14 EM) when transferring to models with different pre-training data. Biomedical QA tasks demonstrate robust domain generalization, with Infini-gram supporting accurate detection of novel entities (low frequency) and unsupported factual claims (zero co-occurrence), leading to reduced hallucinations and improved answer accuracy (Min et al., 22 Dec 2025).

Performance is achieved with less than 3 retrievals per question and consistent sub–10 ms query latency, attributed to the efficiency of the suffix array and FM-index–based architecture.

6. Generalization, Limitations, and Extensions

  • Generalization: Infini-gram’s support for explicit n-gram/entity queries generalizes across domains, enabling use in both general-knowledge and specialized biomedical QA.
  • Limitations: Surface-form matching limits entity alias detection; evolving knowledge bases require periodic re-indexing as corpus facts post-date the index cutoff. Infini-gram operates on the static pre-training corpus and cannot compensate for wholly unseen information (Min et al., 22 Dec 2025).
  • Extensions: Proposed directions include multilingual indexing for cross-lingual query support, time-stamped indexes to enable temporal reasoning, and expanding to event co-occurrences and quantitative/numeric claims.

7. Infini-gram in the Broader RAG Ecosystem

Infini-gram complements traditional vector search and semantic retrieval methods by providing orthogonal and interpretable corpus statistics for real-time verification. As RAG systems increasingly depend on both retrieval quality and verification fidelity, Infini-gram’s inclusion in objective uncertainty estimation pipelines—such as QuCo-RAG—marks a shift toward corpus-grounded, model-agnostic QA and dynamic evidence integration (Min et al., 22 Dec 2025).

A plausible implication is that the widespread adoption of Infini-gram–like engines will drive the development of corpus-aware generation protocols and more transparent QA pipelines in large-scale systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Infini-gram.