Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CitePretrainBench: LLM Citation Benchmark

Updated 30 June 2025
  • CitePretrainBench is a benchmarking suite that evaluates LLMs’ ability to generate verifiable citations from pretraining data using retrieval-free methods.
  • It employs continual pretraining and instruction tuning to bind factual spans to unique document identifiers through active and passive indexing strategies.
  • Active indexing with forward and backward augmentations significantly increases citation precision, enabling reliable verifiable attribution in LLM-generated answers.

CitePretrainBench is a systematic benchmarking suite and experimental platform designed to evaluate and advance retrieval-free, internal citation capabilities of LLMs. Its central goal is to enable LLMs to generate correct, verifiable answers to user queries while reliably attributing individual factual statements to the actual documents encountered during pretraining—all without the need for external document retrieval at inference time.

1. Motivation and Task Formulation

CitePretrainBench addresses fundamental shortcomings in previous LLM citation strategies. Standard LLMs, trained without explicit source association, often hallucinate citations or fail to reliably attribute statements to their actual knowledge provenance. Most contemporary citation systems depend on retrieval-augmented generation (RAG), which invokes an external search engine or retriever during inference. This practice introduces additional latency, infrastructure complexity, dependence on retrieval accuracy, and a dissociation between the model’s outputs and its true parametric knowledge.

The benchmarking goal is to assess and improve a model’s ability to attribute generated content directly to actual pretraining data—specifically, to return document identifiers (human-readable titles) precisely tied to the source corpus used during (continual) pretraining. This enables stricter scientific verifiability and lays the technical groundwork for models that can be trusted to explain their provenance.

2. Methodological Framework and Indexing Strategies

CitePretrainBench centers on a two-stage methodology:

  1. Continual Pretraining for Index Learning

Given a corpus D={(ci,ti)}i=1N\mathcal{D} = \{(c_i, t_i)\}_{i=1}^N where each document cic_i is paired with a unique title identifier tit_i, LLMs are continually pretrained to bind facts within each document to its identifier. The model is explicitly trained such that, for any factual span scis\subset c_i, it learns to map ss to tit_i.

  1. Instruction Tuning for Citation Generation

Following continual pretraining, LLMs are further tuned with instruction-style data, in which answers are paired with correct citations. The format used is:

R=(s1,C1),...,(sm,Cm)\mathcal{R} = \langle (s_1, C_1), ..., (s_m, C_m) \rangle

where each statement sks_k should be attributed to documents CkTC_k \subseteq \mathcal{T}, and citation decoding is constrained to the set of available titles.

Two principal strategies for source-document association are compared:

  • Passive Indexing: The document identifier (title) is appended to each document during pretraining. This is highly effective for verbatim memorization but does not generalize to paraphrased or compositional facts, and adding more granularity (e.g., per-sentence IDs) yields little improvement while potentially harming LLMing.
  • Active Indexing: The model is actively taught to deploy IDs contextually via synthetic QA-based augmentation. This comprises two main augmentations:
    • Forward augmentation (source→fact): For each salient entity in a document, LLMs extract entities and auto-generate QA pairs referencing the document ID, thereby cementing strong identifier-to-fact bindings across diverse natural language surface forms.
    • Backward augmentation (fact→source): Synthetic instructions are generated that require integrating and citing multiple documents per answer, teaching the model the general mapping from answers to their sources in multi-fact, compositional settings.

3. Evaluation Benchmark Composition

CitePretrainBench constructs a purpose-driven "miniature pretraining corpus" aligned with both real-world and synthetic QA benchmarks:

  • Corpora: Wikipedia, Common Crawl (web, via CCNet), arXiv (scientific papers), RepliQA (novel LLM-generated synthetic documents), and supplementary synthetic samples.
  • Evaluation Tasks: Both short-form (single-fact) QA benchmarks (SciQAG, RepliQA) and long-form (multi-fact, free-form) QA tasks (ASQA, ELI5).
  • Dataset statistics: For instance, Wikipedia (30,025 docs, 110M tokens), arXiv (22,743 docs, 114M tokens), and others, covering millions of tokens and wide domain variety.

In each task, the QA system is required not only to answer the question, but also to attribute each factual segment or statement to its originating document identifier.

4. Experimental Comparison: Active vs. Passive Indexing

Empirical evaluation across Qwen-2.5-7B and Qwen-2.5-3B LLMs demonstrates pronounced benefits for Active Indexing:

Method ASQA C-Pr ELI5 C-Pr SciQAG C-Pr RepliQA C-Pr
Instruction only 20.0 5.9 0.6 0.9
Passive 24.1 8.9 2.4 2.4
Active Fwd 26.7 18.6 23.6 12.6
Active Bwd 31.4 28.0 30.8 21.6
Active (Both) 30.9 29.3 32.6 24.4

All Active Indexing variants substantially outperform both the passive and no-indexing baselines, with gains up to 30.2 percentage points in citation precision (C-Pr) over Passive Indexing. Forward and backward augmentations combine synergistically, yielding maximal improvement in both short- and long-form citation settings.

Active Indexing is found to continually increase citation precision as the amount of synthetic QA-based augmented data is scaled, with no observed plateau even at 16× the original corpus size. Larger models confer greater citation gains, suggesting the method is amenable to further scale-up.

5. Ablation Studies and Mechanistic Insights

Ablation experiments reveal that simple increases in passive data or naive paraphrasing without explicit QA-style identifier association do not yield robust citation behavior. More passive data can even reduce transferability to QA/citation tasks, emphasizing overfitting to surface form without general meaning. In contrast, Active Indexing’s explicit QA construction enables the model to bridge memorization and generalization, excelling at both direct recall and compositional citation in answer generation.

6. Limitations and Future Research Directions

Outlined in the paper’s analysis:

  • Scale: The efficacy of Active Indexing is expected to rise further with larger LLMs (14B, 32B, 70B+) and more augmented data (e.g., 32×, 64×) until performance saturates.
  • Multilingual and Domain Transfer: Extension to non-English and high-stakes domains (law, medicine, finance) is anticipated.
  • Hybrid Approaches: Combining retrieval-free citation with retrieval-augmented techniques, potentially allowing models to cite stored knowledge when confident and retrieve when uncertain.
  • Privacy and Copyright: Citing pretraining documents may disclose sensitive or proprietary information, necessitating privacy-preserving identifiers, selective redaction, or other mitigations.
  • Human Evaluation: Although citation precision is measured automatically, human assessment will be essential for real-world trust and deployment.

7. Summary Table

Aspect Approach & Finding
Objective Retrieval-free, verifiable citations; internal provenance; benchmark for short/long-form QA
Methodology Two-stage: continual pretraining (Passive/Active Indexing) and instruction tuning for citation behavior
Benchmark Wikipedia, Common Crawl, arXiv, RepliQA; short and long-form question coverage
Passive vs. Active Passive: appends IDs (weak for compositional/paraphrased facts); Active: QA-based, identifier-centric augmentation (up to 30%+ precision gain)
Results Active Indexing consistently outperforms passive and baseline; benefits increase with data/model scale
Ablations Explicit QA-style identifier use is necessary for generalization as measured by citation precision
Future Work More data/model scale, more diverse languages/domains, retrieval-citation integration, privacy-aware citation, human-centered studies

CitePretrainBench thus formalizes and accelerates research on retrieval-free knowledge attribution, providing a rigorous, extensible foundation for both the evaluation and improvement of LLMs capable of verifiable, trustworthy citation behavior.