Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models (2506.17585v1)

Published 21 Jun 2025 in cs.AI, cs.CL, and cs.LG

Abstract: Trustworthy LLMs should provide both correct and verifiable answers. While LLMs can sometimes attribute their outputs to pretraining data, their citations are often unreliable due to hallucination. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during (continual) pretraining--without test-time retrieval--by revising the training process. To evaluate this, we release CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel, unseen documents and probes both short-form (single fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to bind facts to persistent document identifiers, and (2) instruction tuning to elicit citation behavior. We find that simple Passive Indexing, which appends an identifier to each document, helps memorize verbatim text but fails on paraphrased or compositional facts. Instead, we propose Active Indexing, which continually pretrains on synthetic QA pairs that (1) restate each fact in diverse compositional forms, and (2) require bidirectional source-to-fact and fact-to-source generation, jointly teaching the model to generate content from a cited source and to attribute its own answers. Experiments with Qwen2.5-7B and 3B show that Active Indexing consistently outperforms Passive Indexing across all tasks and models, with citation precision gains up to 30.2 percent. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16 times the original token count.

PDF Abstract

Analysis of Retrieval-Free Knowledge Attribution for LLMs

The paper presented in "Cite Pretrain: Retrieval-Free Knowledge Attribution for LLMs" examines an innovative training process aimed at enabling LLMs to autonomously cite sources from their pretraining data without relying on retrieval mechanisms during inference. This is achieved through a refined pretraining methodology that consists of two primary stages: continual pretraining and instruction tuning. The paper introduces the CitePretrainBench, a benchmark tasked with evaluating the efficacy of this methodology across mixed corpora, including Wikipedia, Common Crawl, arXiv, and novel documents.

The research identifies the limitations of existing retrieval-augmented generation (RAG) techniques, such as increased latency and reliance on external infrastructure, which can inadvertently introduce noise or irrelevant information during the retrieval process. Instead, the authors propose a two-phase approach to pretraining that enhances the internal knowledge attribution capabilities of LLMs. The Passive Indexing technique, characterized by appending document identifiers to text, serves as the baseline. However, experimental findings reveal its inadequacy, as it largely fails to accurately associate paraphrased or non-verbatim information with original sources.

To address these shortcomings, the authors introduce Active Indexing. This method involves generating synthetic question-answer pairs that encapsulate facts in diverse linguistic forms and integrate bidirectional generation tasks (source-to-fact and fact-to-source). This approach allows models to internalize document identifiers more effectively and enables precise citation during the generation phase.

Experiments using Qwen-2.5 models with 7B and 3B parameters show that Active Indexing substantially improves citation precision, outperforming Passive Indexing by up to 30.2%. The paper indicates that larger model sizes favor citation accuracy more markedly than they improve answer correctness, suggesting that efforts to scale model capacity could enhance citation efficacy.

In practical terms, the implementation of Active Indexing in LLMs offers significant advantages. It reduces computation overhead during inference compared to RAG-based systems while maintaining high citation precision. Additionally, by embedding the citation process directly into the model's training phase, the method provides an inherently end-to-end experience that aligns well with explainability requirements, as it enables models to transparently attribute their answers to internalized sources.

Theoretical implications of this research propose a shift in how LLMs manage knowledge attribution—moving from external retrieval dependencies to robust, internalized citation mechanisms. Future exploration could focus on extending this methodology to more specialized domains beyond the scope of general corpora and assessing the scalability of Active Indexing at larger model sizes and with expanded augmentation strategies.

The paper contributes a substantial advancement in the understanding of knowledge attribution in LLMs, potentially leading to more transparent and accountable LLMs in various applications where citation and attribution fidelity are paramount.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yukun Huang (39 papers)
Sanxing Chen (11 papers)
Jian Pei (104 papers)
Manzil Zaheer (89 papers)
Bhuwan Dhingra (66 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/bhuwandhingra/status/1937558897902137600

https://twitter.com/fly51fly/status/1937630686921515256

https://twitter.com/bhuwandhingra/status/1937558905795805335

YouTube

Show All Videos