Source Attribution Bias in ML
- Source attribution bias is a phenomenon where source labels and metadata systematically skew judgments and evaluations in machine learning and language models.
- Controlled experiments using label swaps and counterfactual evaluations reveal significant impact, with biases up to a 24.43% change in evaluation outcomes.
- Mitigation strategies include balanced benchmarking, threshold-free metrics, and neurosymbolic frameworks to reduce disparities in citation and credit allocation.
Source attribution bias denotes systematic distortion introduced when an attributed source, source label, or source-selection mechanism changes how a system judges content, explains outcomes, credits authors, or cites evidence. In recent machine-learning and LLM research, the term spans several related phenomena: source framing in evaluation, demographic asymmetries in causal explanation, metadata-sensitive citation behavior in retrieval-augmented generation, unequal quote attribution and suppression, multimodal grounding failures, and selective disclosure of visited sources in search-enabled systems (Germani et al., 14 May 2025, Raj et al., 28 May 2025, Abolghasemi et al., 2024, Berman et al., 6 Apr 2026, Song et al., 15 Nov 2025, Strauss et al., 27 Jun 2025). A broader survey of attribution methods places these issues within a larger attribution pipeline shaped by “ambiguous knowledge reservoirs,” “inherent biases,” and the drawbacks of excessive attribution (Li et al., 2023).
1. Conceptual scope
The literature does not use a single operational definition of source attribution bias. In evaluation settings, it refers to the fact that the same statement is judged differently depending on who the evaluator is told wrote it; this is called “source framing” or “source attribution” in the evaluation-bias literature (Germani et al., 14 May 2025). In social-psychological benchmarking for LLMs, it refers to whether identical outcomes are explained by internal causes such as ability or effort for some demographic groups, but by external causes such as luck or task difficulty for others (Raj et al., 28 May 2025). In retrieval-augmented generation, it refers to sensitivity and directional bias in citation behavior once source documents carry authorship metadata such as “[Human]” or “[LLM]” (Abolghasemi et al., 2024). In quote attribution, it includes both differential correctness and a distinct failure mode, “suppression,” in which models omit attribution entirely even when authorship information is available (Berman et al., 6 Apr 2026). In search-enabled systems, it appears as an “attribution gap,” defined as the difference between relevant URLs read and URLs actually cited (Strauss et al., 27 Jun 2025).
| Research setting | Attributed object | Bias manifestation |
|---|---|---|
| LLM evaluation | Statement authorship | Source framing changes judgments |
| RAG attribution | Document authorship metadata | Sensitivity and human-authorship preference |
| Causal explanation | Cause of success or failure | Internal vs. external asymmetry by group |
| Quote attribution | Original author | Accuracy disparities and suppression |
| Web-enabled search | Consumed webpages | Relevant pages visited but not cited |
| Multimodal VQA | Text and image evidence | Groundedness gaps and contextual bias |
This diversity suggests that “source attribution bias” is best understood as a family of attribution distortions rather than a single metric. A recurring commonality is that the system’s behavior is influenced by source identity, source metadata, or source exposure in ways that are not reducible to the informational content alone.
2. Source framing in evaluation and retrieval-augmented generation
A prominent line of work studies source framing as evaluative bias. In a two-phase full-factorial design, four LLMs generated 4,800 narrative statements over 24 socially sensitive topics, and the same 4,800 statements were then re-evaluated by each model under 10 attribution conditions, yielding 192,000 assessments (Germani et al., 14 May 2025). In the blind condition, inter-model and intra-model agreement remained consistently above 90%, with most values closer to 95% or more. The paper defines attribution bias relative to blind evaluation as , where is the mean score under condition . Once source framing was introduced, this alignment degraded. The clearest reported effect was that attribution to “a person from China” systematically lowered agreement scores across all four evaluator models. For the full dataset, the strongest reported negative bias was for Deepseek Reasoner, , ; in Cluster 7 (“Politics and international relations”), the same attribution produced , (Germani et al., 14 May 2025). Topic-level examples such as Taiwan, Ukraine, Gaza, and media freedom show that the same text can receive sharply different evaluations solely because the putative source identity changes.
A parallel effect appears in RAG citation behavior. A counterfactual evaluation of GPT-4, Llama 3, and Mistral distinguishes attribution sensitivity from attribution bias by comparing Vanilla RAG, Authorship-Informed RAG, and Counterfactual-Authorship Informed RAG (Abolghasemi et al., 2024). The main finding is that adding authorship information changes attribution quality by 3% to 18%, and all three models exhibit a consistent bias toward explicitly human-labeled documents. The paper’s Mixed RAG setting is especially important: even when all retrieved documents are of the same actual origin, either all human or all synthetic LLM text, the human-authorship preference persists under label swaps. This isolates metadata effects from stylistic differences in the underlying documents (Abolghasemi et al., 2024).
These studies support a common interpretation: LLMs do not merely retrieve or evaluate content; they also apply source-sensitive trust heuristics. A plausible implication is that apparently stable performance under blind benchmarking can mask brittle or unfair behavior once source identity becomes visible.
3. Attribution of causes to people and groups
A second literature uses “attribution bias” in the classical social-psychological sense of assigning causes to observed outcomes. “Talent or Luck? Evaluating Attribution Bias in LLMs” grounds this setting in Attribution Theory and Weiner’s achievement framework, treating ability and effort as internal attributions and task difficulty and luck as external attributions (Raj et al., 28 May 2025). The benchmark contains 140k prompts over 400 high-quality templates, spans gender, nationality, race, and religion, and evaluates three settings: single-actor, actor–actor, and actor–observer.
The central empirical claim is that models generally follow a broad success/internal and failure/external tendency, but meaningful asymmetries appear across identities, model families, and domains. The paper’s summary statement is that “attribution discrepancies are observed across identities, with marginalized groups receiving less credit for success and more blame for failure” (Raj et al., 28 May 2025). Gender differences are described as the most pronounced. Reported examples include Asian, Middle Eastern, and Hispanic women receiving more internal attributions than their male counterparts; White and Black males receiving predominantly external attributions; and failures of Russian, French, German, Japanese, and Korean people often being attributed to internal factors (Raj et al., 28 May 2025).
The benchmark also reveals model-family and scenario dependence. Aya leans strongly toward external attribution overall, often using task difficulty and luck, whereas Qwen and Llama lean more toward internal attribution, especially effort. Actor–actor and actor–observer settings further show that direct comparison and social framing can amplify disparities. The paper reports that models favor dominant or Western identities in some cross-gender comparisons and that observer framing can reverse single-actor trends, especially for failures (Raj et al., 28 May 2025).
This line of work broadens the concept of source attribution bias beyond citation or provenance. Here the “source” is the implied origin of an event outcome—within the person or in the surrounding situation. The common thread is differential assignment of explanatory source under controlled identity manipulations.
4. Credit allocation, omission, and suppression
Source attribution bias also concerns whether systems credit original authors equitably. “Attribution Bias in LLMs” introduces AttriBench, described as the first fame- and demographically-balanced quote attribution benchmark dataset, with two versions: AttriBench Intersectional and AttriBench Multirace (Berman et al., 6 Apr 2026). The benchmark is built by pruning and rebalancing JSTET quote–author pairs, labeling race and gender, and greedily matching authors across groups so that mean fame is nearly identical across subgroups. This design targets a central confound in attribution tasks: attribution accuracy rises with author fame, so demographic disparities are difficult to interpret without fame control.
Even with fame controlled, quote attribution remains difficult and uneven. Under direct prompting, GPT-5.1 and Claude 4.6 Sonnet achieve about 25%–27% accuracy on the intersectional dataset and about 21%–23% on the multirace dataset, while several other models remain below 10% (Berman et al., 6 Apr 2026). Across models and prompt settings, White male is the highest-accuracy subgroup in the intersectional dataset, and Black female is consistently the lowest. In the multirace dataset, White is the highest-accuracy subgroup in every model and prompt except GPT-OSS-120B (Berman et al., 6 Apr 2026).
The paper’s most distinctive addition is “suppression,” a failure mode in which models omit attribution entirely. It defines omission suppression under indirect prompting and evidence-conditioned suppression when the true author is present in retrieved context but still not mentioned (Berman et al., 6 Apr 2026). Suppression is widespread and unevenly distributed: White and White male authors have the lowest suppression in every model except the weakest GPT-OSS-120B case, and other subgroups show statistically significant increases. Retrieval largely fixes direct-prompt identification, but under indirect prompting disparities persist even when the correct author is explicitly available (Berman et al., 6 Apr 2026).
This work shifts the discussion from mere correctness to representational fairness. Misattribution credits the wrong source; suppression erases the source from the user-facing answer. The latter is especially consequential for systems that summarize, search, or paraphrase rather than explicitly answer attribution questions.
5. Retrieval, grounding, and multimodal source selection
Another research strand studies source attribution bias as selective disclosure, over- or under-citation, and grounding asymmetry. In web-enabled search systems, the “attribution gap” is the difference between unique relevant URLs visited and unique URLs cited (Strauss et al., 27 Jun 2025). Using approximately 14,000 real-world LMArena conversation logs, the paper identifies three exploitation patterns: No Search, No citation, and High-volume, low-credit. Reported figures include 34% of Google Gemini and 24% of OpenAI GPT-4o responses generated without explicitly fetching online content; Gemini providing no clickable citation source in 92% of answers; and Perplexity’s Sonar visiting approximately 10 relevant pages per query but citing only three to four (Strauss et al., 27 Jun 2025). A negative binomial hurdle model estimates that the average query answered by Gemini or Sonar leaves about 3 relevant websites uncited, and citation efficiency varies from 0.19 to 0.45 on identical queries (Strauss et al., 27 Jun 2025).
At the sentence-attribution level, “Think Before You Attribute” addresses technical distortions in RAG citation workflows by inserting a pre-attribution classifier that predicts whether a sentence requires zero, one, or multiple references (Batista et al., 19 May 2025). The paper explicitly discusses over-referencing, under-referencing, and invalid sentences, and reports a systematic tendency in unclean datasets for zero-reference and multiple-reference cases to be mistaken for single-reference sentences. It also notes that unsupported hallucinated content tends to be assigned the “most similar” quote, producing misleadingly grounded citations (Batista et al., 19 May 2025). These are not framed as fairness biases, but they are clear cases of systematic source-assignment distortion.
Multimodal settings introduce an additional layer. MAVIS is described as the first benchmark for multimodal source attribution systems in long-form visual question answering, with 157K visual QA instances and fact-level citations to multimodal documents (Song et al., 15 Nov 2025). Its key findings are that LVLMs with multimodal RAG generate more informative and fluent answers than unimodal RAG, but exhibit weaker groundedness for image documents than for text documents, and that this gap is amplified in multimodal settings. The abstract further states that mitigating contextual bias in interpreting image documents is a crucial direction for future research (Song et al., 15 Nov 2025).
Taken together, these works show that source attribution bias is not limited to overt source labels. It also appears when systems visit more sources than they disclose, collapse multi-source support into a single citation, or ground text evidence more reliably than image evidence.
6. Measurement, diagnostics, and methodological artifacts
Measurement itself can become a source of attribution distortion. One example is contributive attribution in LLMs: whether an answer is primarily grounded in prompt context or in parametric memory. “Probing for Knowledge Attribution in LLMs” treats this as a binary classification problem over hidden representations and reports that probes trained on AttriWiki achieve up to 0.96 Macro-F1 in-domain and 0.94–0.99 Macro-F1 on SQuAD and WebQuestions without retraining (Brink et al., 26 Feb 2026). The most consequential result is that attribution mismatches—cases where the inferred source is not the source that should have been used—raise error rates by up to 70%. The paper also reports that when both sources suffice, models overwhelmingly default to the contextual channel in 87.1% of cases (Brink et al., 26 Feb 2026). This does not define a demographic bias, but it shows that source confusion is measurable and tightly linked to hallucination and unfaithfulness.
A second example comes from neural attribution benchmarking. “Systematic Evaluation of Attribution Methods” argues that evaluation of attribution maps can be dominated by threshold selection bias, with single-threshold overlap metrics reversing method rankings by over 200 percentage points (Aksoy, 3 Sep 2025). The proposed AUC-IoU integrates performance over thresholds and is presented as a threshold-free remedy. In this setting, the “source” of an apparent performance difference may be the evaluator’s binarization choice rather than the attribution method itself (Aksoy, 3 Sep 2025). This is a methodological rather than semantic use of attribution bias, but it is relevant because it shows how attribution claims can be artifacts of evaluation design.
A third diagnostic perspective appears in brain-to-language retrieval. “What Are We Actually Decoding? Source Attribution for Non-Invasive Brain-to-Language Retrieval” argues that reported gains can come from structural shortcuts, window-level stimulus-locked evidence, or cross-window contextual aggregation, and that performance should be “source-attributed, not merely reported” (Zhang et al., 23 May 2026). The paper shows Gaussian noise can reach 66.3% Rank@1 under variable-length decoding, whereas fixed-duration controls collapse this shortcut; it then isolates a contextual contribution via Group Context Bias, shifting Rank@1 from 44% to 52% on Gwilliams and from 22% to 29% on MOUS (Zhang et al., 23 May 2026). The broader lesson is that attribution systems require diagnostics that separate genuine support from artifacts, priors, and structural leakage.
These methodological studies caution against a common misconception: that adding an attribution mechanism is sufficient. The reliability of attribution depends not only on the model but also on the diagnostic regime, the operating threshold, and whether the claimed source is the actual source of the observed performance.
7. Mitigation strategies and open problems
The literature proposes no single remedy, but several directions recur. Counterfactual evaluation with label swaps is a direct audit for metadata-sensitive RAG attribution, and the CAS/CAB framework provides a concrete way to quantify sensitivity and bias under fixed document content (Abolghasemi et al., 2024). Balanced benchmarks such as AttriBench address confounds like fame and add suppression as a necessary metric alongside accuracy (Berman et al., 6 Apr 2026). Threshold-free evaluation such as AUC-IoU is proposed to remove ranking instability induced by arbitrary cutoffs (Aksoy, 3 Sep 2025). Sentence-level pre-attribution seeks to reduce over-attribution, single-reference collapse, and unsupported quote assignment before retrieval is performed (Batista et al., 19 May 2025).
Other proposals are more architectural. A neurosymbolic attribution framework argues for grounding outputs in knowledge graphs, ontologies, and verified databases, with metacognitive control and temporal reasoning used to prioritize trusted, current sources over popular but unreliable ones (Tilwani et al., 2024). In multimodal attribution, MAVIS identifies mitigating contextual bias in image interpretation as a central open problem (Song et al., 15 Nov 2025). For web-enabled search, transparent telemetry and full disclosure of search traces and citation logs are recommended, including standardized retrieval identifiers and scores (Strauss et al., 27 Jun 2025).
Several open questions remain unresolved. The current evidence does not determine whether attribution disparities arise primarily from pretraining distributions, instruction tuning, safety layers, retrieval stacks, or interfaces; several papers explicitly treat these as plausible but unproven contributors (Germani et al., 14 May 2025, Abolghasemi et al., 2024). Cross-domain robustness is also uneven: binary authenticity detection transfers better than fine-grained source attribution in video forensics, and multimodal groundedness remains weaker for image documents than for text (Kundu et al., 16 Nov 2025, Song et al., 15 Nov 2025). More broadly, the literature suggests that source attribution bias should be studied at multiple levels simultaneously: who is credited, which evidence is surfaced, whether disclosure matches consumption, and whether the reported source is truly the source on which the system relied.