Papers
Topics
Authors
Recent
2000 character limit reached

QuCo-RAG: Corpus-Grounded Uncertainty in RAG

Updated 23 December 2025
  • QuCo-RAG is a retrieval-augmented generation framework that leverages pre-training corpus statistics to quantify uncertainty and trigger evidence retrieval to reduce factual hallucinations.
  • It implements a two-stage process—pre-generation entity frequency assessment and runtime verification via entity-pair co-occurrence—to make informed retrieval decisions during generation.
  • Empirical evaluations on multi-hop and biomedical QA benchmarks demonstrate significant exact match improvements with minimal token overhead, highlighting its efficiency and scalability.

QuCo-RAG (Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation) is a retrieval-augmented generation (RAG) framework that utilizes objective, corpus-derived statistics to trigger external knowledge retrieval during LLM generation. By preemptively grounding uncertainty estimation in the coverage of the pre-training corpus rather than unreliable model-internal signals, QuCo-RAG mitigates hallucination—a phenomenon in which LLMs generate unsupported or incorrect factual statements with unwarranted confidence—without requiring access to the internal parameters or logits of the model (Min et al., 22 Dec 2025).

1. Motivation and Conceptual Framework

Conventional dynamic RAG techniques, such as FLARE, DRAGIN, ETC, and SeaKR, rely on internal signals, including token probabilities, entropy, and attention weights, to detect moments of uncertainty and trigger retrieval. Decades of empirical and theoretical work have established that modern LLMs are fundamentally ill-calibrated: their confidence estimates frequently do not align with their actual knowledge, leading to both "confident hallucinations" and conservative underestimates of uncertainty. Examples in prior work demonstrate misassigned uncertainty, such as overestimating the uncertainty of common question tokens and underestimating it for fabricated facts.

QuCo-RAG shifts from this paradigm by leveraging two corpus-grounded signals:

  • Low-frequency entities in the pre-training corpus correlate with "long-tail" knowledge, which LLMs are unlikely to have reliably memorized.
  • Zero or very low co-occurrence of entity pairs indicates a lack of any evidential basis for a claimed relation, and thus a high risk for hallucination.

These signals, computed via fast queries to the pre-training corpus, offer a semantically meaningful, model-agnostic trigger for retrieval that directly reflects the model's evidential coverage.

2. Two-Stage Corpus-Based Uncertainty Quantification

QuCo-RAG implements a two-stage detection process at each generation step, computing a binary retrieval decision δi{0,1}\delta_i \in \{0,1\} for sentence sis_i given the question QQ, prior generations s<is_{<i}, and corpus P\mathcal{P}:

Stage 1: Pre-Generation Knowledge Assessment

Before generation, a lightweight entity extractor identifies the set of entities EQ\mathcal{E}_Q in QQ. For each entity ee, its frequency freq(e;P)\text{freq}(e;\mathcal{P}) is queried using Infini-gram over the pre-training corpus. If the average entity frequency is below a threshold τentity=103\tau_{\mathrm{entity}}=10^3,

δpre=I(1EQeEQfreq(e;P)<τentity),\delta_{\rm pre} = \mathbb{I}\left( \frac{1}{|\mathcal{E}_Q|} \sum_{e\in\mathcal{E}_Q} \text{freq}(e;\mathcal{P}) < \tau_{\mathrm{entity}} \right),

a retrieval query is issued, typically retrieving top-kk documents with BM25, to augment the LLM's context before any tokens are generated.

Stage 2: Runtime Claim Verification

During generation, after each sentence sis_i, all factual triplets Ti={(h,r,t)}\mathcal{T}_i = \{(h,r,t)\} are extracted using a 0.5B parameter triplet extractor. For each (h,t)(h,t) pair, the co-occurrence cooc(h,t;P)\operatorname{cooc}(h,t;\mathcal{P}) within a window (e.g., 1,000 tokens) is computed. If for any triplet, the minimal co-occurrence falls below τcooc=1\tau_{\mathrm{cooc}}=1,

δi=I(min(h,r,t)Ticooc(h,t;P)<τcooc),\delta_i = \mathbb{I}\left( \min_{(h,r,t)\in\mathcal{T}_i} \operatorname{cooc}(h,t;\mathcal{P}) < \tau_{\mathrm{cooc}} \right),

an evidence-oriented retrieval using the concatenated hrh \oplus r is performed, and sis_i is re-generated with augmented evidence.

This two-stage mechanism directly couples empirical corpus coverage to the model’s retrieval policy.

3. Infini-gram Index: Millisecond-Latency Corpus Querying

Infini-gram is a suffix array–based data structure indexing 4×1012\sim4\times10^{12} tokens, supporting sub–10 ms nn-gram frequency and co-occurrence queries on commodity CPU and SSD infrastructure. The core API exposes methods such as:

  • freq(query_string) → absolute frequency;
  • cooc(string1, string2, window_size) → co-occurrence count in sliding windows.

Queries are resolved via binary search for pattern frequency and by intersecting posting lists for co-occurrence, enabling online integration with LLM generation. This index is essential for timely, corpus-wide uncertainty quantification and is crucial for the scalability of QuCo-RAG.

4. Algorithmic Workflow and Model Integration

QuCo-RAG is model-agnostic and requires no architectural modification to the underlying LLM M\mathcal{M}. It operates by sequentially interleaving corpus queries and model calls according to the retrieval policy outlined above. The following describes the integration workflow:

  1. Pre-Check:
    • Extract entities EQ\mathcal{E}_Q from QQ
    • For each eEQe \in \mathcal{E}_Q, compute frequency
    • If δpre=1\delta_{\rm pre}=1: retrieve context documents for QQ and prepend before generation
  2. For each sentence sis_i:
    • Extract factual triplets Ti\mathcal{T}_i
    • Compute all pairwise entity co-occurrences
    • If δi=1\delta_i=1: retrieve supporting documents for hrh \oplus r, prepend, and re-generate sis_i
    • Append sis_i to the context and continue

When the actual pre-training corpus is unavailable (as with proprietary models like GPT-4.1/5), large open corpora (e.g., the OLMo-2 corpus) serve as effective proxies due to substantial token overlap among web-scale corpora.

Stage Signal Retrieval Trigger
Pre-generation Avg. entity frequency If < τentity\tau_{\mathrm{entity}}
During generation (per sis_i) Entity-pair co-occurrence If < τcooc\tau_{\mathrm{cooc}}

5. Empirical Performance and Analysis

Experiments are conducted on multi-hop QA (2WikiMultihopQA, HotpotQA) and biomedical QA (PubMedQA). Evaluations cover both matched-corpus (OLMo-2-Instruct 7B/13B/32B) and proxy-corpus (Llama-3-8B, Qwen2.5-32B, GPT-4.1, GPT-5-chat) settings. Baselines include Wo-RAG (no retrieval), static retrieval variants (SR-RAG, FS-RAG), and dynamic retrievals using internal signals.

Key findings:

  • On OLMo-2-7B: QuCo-RAG achieves 32.7 EM (2Wiki) vs. 25.3 baseline (+7.4), 35.3 EM (Hotpot) vs. 29.7 (+5.6).
  • Larger models (13B/32B) yield even larger absolute gains (+12.0 and +10.8 EM, respectively).
  • Transfer to Qwen2.5-32B: +14.1 EM; to Llama-3-8B: +4.9 EM; to GPT-5-chat: +8.7 EM.
  • On PubMedQA: 66.4% accuracy vs. 55.2% for Wo-RAG (+11.2), with only 0.93 retrievals per question and low token overhead.

Ablation studies show that removing pre-generation assessment or runtime verification degrades EM by 2.5 and 5.1 points, respectively. Efficiency evaluations indicate QuCo-RAG triggers fewer retrievals and requires the lowest average number of tokens and LLM calls among baselines; run-time is dominated by LLM inference, with corpus queries introducing only millisecond-scale overhead.

Entity frequency analysis demonstrates that QuCo-RAG delivers pronounced improvement in low-frequency bins (10–17 EM points over Wo-RAG for 0–10 frequency entities), surpassing internal-signal methods even in high-frequency regimes.

6. Theoretical Properties, Limitations, and Future Directions

By basing retrieval triggers on corpus coverage, QuCo-RAG bypasses calibration errors associated with model probability estimates. The binary retrieval decision δ{0,1}\delta \in \{0,1\} transparently indicates the presence or absence of evidential support in the corpus.

No domain-specific threshold tuning is needed; a fixed τentity=103\tau_{\mathrm{entity}} = 10^3 and τcooc=1\tau_{\mathrm{cooc}} = 1 perform robustly across both open-domain and specialized biomedical benchmarks.

Identified limitations include:

  • Lexical Matching: The method relies on exact string co-occurrences, possibly missing aliasing (e.g., "NYC" vs. "New York City"). QuCo-RAG adopts a conservative retrieval policy under ambiguity; potential improvements include incorporating entity linking or alias normalization.
  • Static Corpus: The inability to verify emerging or post-cutoff facts can be mitigated by periodic corpus updates or time-stamped indexing.

Future research directions include extending verification to multilingual and temporally dynamic corpora, supporting complex event and numeric claim verification, developing self-verification agents empowered by QuCo-RAG signals, and providing theoretical analysis of hallucination probabilities as a function of corpus statistics.

In summary, QuCo-RAG introduces a principled, practically efficient paradigm for dynamic retrieval in LLMs by grounding uncertainty estimation in the empirical statistics of the pre-training corpus. It demonstrates substantial and transferable improvements over internal-signal-based dynamic retrieval baselines, independent of model architecture or proprietary pre-training data (Min et al., 22 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to QuCo-RAG.