Papers
Topics
Authors
Recent
Search
2000 character limit reached

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

Published 11 Jun 2026 in cs.LG | (2606.13104v1)

Abstract: LLMs are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: https://github.com/floating-reeds/AuthorityBench

Summary

  • The paper introduces AuthorityBench, isolating citation effects via a 2x2 factorial design across multiple domains to measure hallucination rates.
  • It reports that the TCร—FC condition significantly increases hallucination, with lifts of up to 22.29 percentage points over true-claim baselines.
  • Findings reveal that model architecture, rather than scale, governs susceptibility to citation signals, challenging assumptions of citation reliability.

AuthorityBench: Benchmarking Epistemic Susceptibility to Citation Signals in LLMs

Motivation and Scope

AuthorityBench addresses the epistemic vulnerability of LLMs to citation-based authority signals by systematically isolating the effect of citationsโ€”fabricated or realโ€”on model behavior, independent of factual claim content. Prior benchmarks focus on factual correctness, hallucination rates, or citation faithfulness but do not examine how citations influence LLMsโ€™ epistemic reasoning. AuthorityBench utilizes a fully balanced 2ร—22\times2 factorial design, manipulating claim veracity (true/false) and citation veracity (real/fabricated) across general knowledge, science, law, and medicine, with controlled variation over 40 prompt templates, venue prestige tiers, and author demographics. This design uniquely enables causal inference about authority signals and introduces the novel TCร—\timesFC (true claim, fabricated citation) condition to probe whether fabricated citations induce denial of correct facts. Figure 1

Figure 1: The 2ร—22\times2 factorial design, crossing claim and citation veracity, forms the foundation of AuthorityBench.

Benchmark Construction

Claims are sourced from FEVER (general knowledge), SciQ (science MCQ), CaseHOLD (legal MCQ), and MedMCQA (medical MCQ), yielding 110,282 base claims and 220,564 prompts after crossing citation conditions. Fabricated citations are sampled from curated pools for author, venue (tiered by prestige), and year. Real citations align with source datasets, except for general knowledge, which lacks structured citation records. Venue and author variation enable analysis of institutional authority and demographic effects. The pipeline carefully maintains balancing and domain/citation structure consistency to prevent confounds. Figure 2

Figure 2: Dataset construction workflow, integrating claim extraction, citation fabrication, and prompt templating.

Experimental Setup

Seven LLMs, spanning both open and proprietary instruction-tuned and base variants across parameter scales, are evaluated using Qwen3-8B as a judge model. Models are tested on the full dataset (open models) or a stratified 15K prompt subset (proprietary/models with compute constraints). Hallucination rates are measured as the proportion of outputs labeled "hallucinated". All evaluations use binary ground truth labels supplied with citation metadata, validated by high inter-annotator agreement (Cohen's kappa = 0.83).

Core Findings

Citation-Induced Hallucination

AuthorityBench establishes that citation presenceโ€”regardless of factualityโ€”increases hallucination rates relative to no-citation baselines for all tested models. The critical TCร—\timesFC condition universally exhibits the highest hallucination rates, with lifts of +3.23+3.23 to +22.29+22.29 percentage points over true-claim baselines. In the general knowledge domain, hallucination rates reach 35โ€“77%, with some models approaching ceiling in this setting. Figure 3

Figure 3: Hallucination rates across citation conditions reveal TCร—\timesFC as the highest-risk scenario across all evaluated LLMs.

Notably, larger or more capable models do not consistently exhibit greater robustness. Both instruction tuning and parameter scale fail to mitigate epistemic susceptibility to authority signals.

False-Claim Responses: Suppression vs. Amplification

For false claims, models bifurcate: some (Llama, Claude, GPT 5.4 mini) exhibit citation-induced suppressionโ€”fabricated citations reduce hallucinationโ€”while others (Gemma variants, Phi-4) show amplification. Family-level properties, not scale, mediate these effects. Figure 4

Figure 4: False-claim condition lifts highlight model-specific patterns: suppression in some, amplification in others, with DeepSeek V3.2 neutral.

Domain, Structure, and Authority Effects

  1. Domain Sensitivity: General knowledge is the most vulnerable, while legal claims are robust to citation signalsโ€”likely due to distinctive citation conventions raising evidentiary standards.
  2. Prestige and Demographics: Venue prestige and author demographic signals (surname region) do not significantly alter hallucination rates. Fictitious citations from elite venues confer no added authority. Temporal citation framing (year) has only minor effects.
  3. Prompt Structure: Citation format impacts hallucination: prefix placement (e.g., "According to [source]") is universally high-risk, while minimal salience structures (author/year only) are often protective.
  4. Domain Alignment: Cross-domain citations generally degrade performance more than same-domain citations, especially in legal and general domains. Figure 5

    Figure 5: Secondary resultsโ€”domain effects, template structure, venue prestige, author demographics, domain alignmentโ€”quantify nuanced drivers of hallucination.

Practical and Theoretical Implications

The finding that citation presence degrades model factual accuracyโ€”including for claims models would otherwise handle correctlyโ€”has direct implications for RAG and citation-augmented systems in critical domains. Despite modelsโ€™ high factual performance on baseline prompts, citation signals function as epistemic authority rather than evidence, leading to systematic denial of correct information. This exposes a risk in trust calibration: users and system designers should not assume citations improve reliability, and model deployment in high-stakes settings must be accompanied by targeted mitigation.

Theoretically, AuthorityBench demonstrates that epistemic deference heuristics are internalized by LLMs, independent of factual grounding, challenging claims about scale-driven robustness. The suppression/amplification split by model family suggests that architectural or training choicesโ€”not scaleโ€”govern susceptibility to authority signals.

Future Directions

  1. Mechanistic Analysis: Investigation of attention patterns and hidden state dynamics to elucidate why citation signals disrupt true claim processing and drive model family divergence.
  2. Mitigation Evaluation: Prompt-based, fine-tuning, or architectural interventions can be systematically tested using AuthorityBench to reduce citation-induced hallucination.
  3. Demographic Granularity: Finer-scale demographic signals may expose more subtle biases in epistemic authority processing.
  4. Frontier Model Testing: Extending evaluation to larger, more advanced LLMs.

Conclusion

AuthorityBench rigorously quantifies how citation-based authority signals induce epistemic failure modes in LLMs, with citation presenceโ€”fabricated or realโ€”systematically increasing hallucination susceptibility. The critical TCร—\timesFC condition reveals that models are especially prone to denying correct facts when accompanied by fabricated citations, reaching high hallucination rates irrespective of model capability or domain. Venue prestige and author demographics are largely inert; legal claims are resistant due to citation conventions. These results demonstrate that citation signals are treated as authority by LLMs, not evidence, and underscore the need for model-specific mitigation strategies in citation-sensitive applications. AuthorityBench provides a foundation for future research in mechanistic interpretability, mitigation, and robust epistemic reasoning in AI systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 4 likes about this paper.