Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLM hallucinations in the wild: Large-scale evidence from non-existent citations

Published 8 May 2026 in cs.DL, cs.AI, cs.CY, and physics.soc-ph | (2605.07723v1)

Abstract: LLMs are known to generate plausible but false information across a wide range of contexts, yet the real-world magnitude and consequences of this hallucination problem remain poorly understood. Here we leverage a uniquely verifiable object - scientific citations - to audit 111 million references across 2.5 million papers in arXiv, bioRxiv, SSRN, and PubMed Central. We find a sharp rise in non-existent references following widespread LLM adoption, with a conservative estimate of 146,932 hallucinated citations in 2025 alone. These errors are diffusely embedded across many papers but especially pronounced in fields with rapid AI uptake, in manuscripts with linguistic signatures of AI-assisted writing, and among small and early-career author teams. At the same time, hallucinated references disproportionately assign credit to already prominent and male scholars, suggesting that LLM-generated errors may reinforce existing inequities in scientific recognition. Preprint moderation and journal publication processes capture only a fraction of these errors, suggesting that the spread of hallucinated content has outpaced existing safeguards. Together, these findings demonstrate that LLM hallucinations are infiltrating knowledge production at scale, threatening both the reliability and equity of future scientific discovery as human and AI systems draw on the existing literature.

Summary

  • The paper demonstrates a surge in LLM-induced hallucinated citations across millions of articles, establishing a conservative lower bound on non-existent references.
  • The study employs a rigorous audit pipeline that cross-references citation titles with databases like Semantic Scholar and OpenAlex to distinguish hallucinations from normal noise.
  • The findings reveal significant systemic impacts on scholarly communication, including skewed academic credit and increased citation biases affecting early-career and marginalized authors.

Large-Scale Quantification of LLM-Induced Hallucinated Citations in Scientific Literature

Introduction

This study conducts a meticulous large-scale analysis of hallucinated citations embedded in scientific literature, directly attributable to the adoption of LLM-assisted writing. Leveraging the binary existence of scientific citations as a ground truth, the authors construct an audit pipeline and apply it to 111 million references spanning 2.5 million articles from arXiv, bioRxiv, SSRN, and PubMed Central. Their findings quantitatively demonstrate the pervasiveness, distribution, and systemic impact of LLM-induced hallucinations, establishing a conservative lower bound and detailing how these errors propagate, whom they benefit, and how current safeguards fare at detection and removal.

Methodology

The pipeline extracts and parses all references, focusing on verification of reference titles using locally indexed databases (Semantic Scholar, OpenAlex), further cross-checked by Google Scholar. The approach distinguishes unmatched citations arising from normal bibliographic noise and parsing errors by benchmarking pre-LLM adoption rates (pre-2023), enabling excess unmatched citations in the post-LLM era to be attributed to hallucinations. The pipeline’s multi-stage cleaning process—including LLM-based filtering for non-academic artefacts—reduces false positives, and manual validations ensure robust matching criteria.

Key Findings

Sharp Increase in Hallucinated Citations

There is a marked surge in hallucinated citations post-LLM uptake, with unmatched citation rates peaking by August 2025 at 0.39% (arXiv), 0.21% (bioRxiv), 1.91% (SSRN), and 0.27% (PMC). The rise is diffuse: most manuscripts exhibit a modest proportion of hallucinated references, rather than a minority of heavily contaminated papers. The extrapolated annual volume in these corpora alone exceeds 146,000 non-existent references for 2025, setting a strong lower bound given the coverage and conservative methodology.

Correlates and Distribution

Fields with high AI adoption, notably social sciences and computer science, are disproportionately affected. Linguistic markers of AI-generated text, measured via standard LLM-use detection tools, are strongly correlated with hallucination rates at both subfield and individual manuscript levels (Pearson r = 0.441, P < 0.001).

The authors documenting hallucinated sources ("hallucination citers") have significantly lower publication counts pre-2023 (e.g., 62% lower in arXiv than controls), but this productivity gap closes in 2025 due to LLM-enabled output acceleration. Smaller teams and early-career authors are especially susceptible, indicating that LLMs lower entry barriers for less experienced contributors, potentially amplifying the impact and diffusion of unreliable citations.

Structural and Equity Implications

Hallucinated citations are not randomly distributed in whom they benefit. They disproportionately allocate credit to already prominent and predominantly male scholars: the cited authors in hallucinated references have significantly greater prior productivity and citation impact compared with control samples, and an overrepresentation of male names (7.6% relative increase). Authorship conventions are also eroded, with hallucinated references favoring smaller teams and deviating from first-last author hierarchy norms. The pattern persists even for valid citations in the same papers, suggesting broader reprioritization of credit by LLM tooling.

Efficacy of Quality Control Mechanisms

Existing moderation and peer review only partially stem the infiltration of hallucinated citations:

  • Only 21.2% of hallucinated references in arXiv submissions are filtered by moderation.
  • 85.3% of hallucinated citations present in preprints persist into journal-published versions (bioRxiv to PMC).
  • The problem is not restricted to lower-tier journals; hallucinated citations permeate the journal impact spectrum, including in high-impact outlets.
  • Hallucinated references are increasingly embedded in bibliometric databases (e.g., Google Scholar), further eroding ground truth and enabling recursive contamination in future LLM training data.

Broader Theoretical and Practical Implications

The evidence here refutes "few bad apples" narratives and demonstrates that LLM hallucinations are a diffuse, systemic phenomenon within scientific knowledge production. The cumulative, path-dependent structure of scientific citation ensures that errors, once seeded, compound over time, affecting downstream research, systematic reviews, meta-analyses, and increasingly, the very LLMs trained on such records. This establishes feedback loops where model outputs pollute future model training data, as highlighted in work on model collapse [34].

Furthermore, the study reveals that current agentic verification tools—though promising—are only effective for structured, indexable domains like science, where ground truth is well-defined and infrastructure is robust. In less structured domains (government reports, clinical documentation, legal filings), the problem likely has even greater prevalence and less tractability.

Limitations and Directions for Future Research

The approach may underestimate true hallucination rates due to limitations in title-based detection, coverage gaps for niche or mathematically dense documents, and inability to capture more insidious forms of hallucination (e.g., misattributing valid, real papers to claims they do not support [35–38]). Enhanced claim-level fact-checking, improved parsing for unstructured domains, and further research on feedback loops in recursive LLM-training remain urgent priorities.

Conclusion

This work establishes that LLM-induced hallucinated citations have become a persistent and growing contaminant in scientific literature, with tangible effects on the allocation of academic credit and the reliability of scholarly communication. The diffusion is broad and systematic, with new entrants empowered by LLMs amplifying both the volume and reach of such errors. Quality assurance mechanisms are insufficient to fully mitigate the problem, and bibliometric databases are themselves increasingly contaminated.

Ongoing advances in automated verification, if implemented at scale, may stem further contamination within structured domains. However, as LLMs become ubiquitous as knowledge tools and content generators, the security and reliability of the scientific record—and by extension, the foundations for downstream AI and decision-making systems—will require sustained attention, infrastructure investment, and innovations in fact-checking that scale to less tractable domains.

The infiltration of hallucinated content into foundational knowledge systems demonstrated here is likely a lower bound for broader societal risk, setting a research agenda for mitigating epistemic contamination in the age of ubiquitous LLMs.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Explaining “LLM hallucinations in the wild: Large-scale evidence from non-existent citations”

What this paper is about (big picture)

This paper looks at a growing problem with AI writing tools, like ChatGPT: sometimes they make up things that sound real but aren’t. The authors focus on one very clear kind of mistake—fake scientific references (citations) that don’t actually exist—and show how often these are appearing in real research papers. They find that these fake citations have started spreading through science, which can make the scientific record less reliable and less fair.

What questions the researchers asked

The team wanted to know:

  • How often are fake (non-existent) citations showing up in real scientific papers?
  • When did this problem start growing?
  • Which fields and types of authors are most involved?
  • Who gets “credit” from these fake citations?
  • Are current safety checks (like preprint moderation and journal peer review) catching the problem?

How they studied it (methods made simple)

Think of a citation as a library card for a book: it should point to a real, findable book. If the card points to a book that no one can find anywhere, it’s probably fake.

The researchers:

  • Collected 111 million references from 2.5 million papers across four big sources: arXiv, bioRxiv, SSRN (preprint servers) and PubMed Central (peer-reviewed journals).
  • Focused on the part of each reference that’s easiest to verify: the title. A real title should be findable in major databases.
  • Built a step-by-step checking system:
    • Extracted each reference’s title from papers.
    • Searched large academic databases (Semantic Scholar, OpenAlex) to find a match.
    • Cleaned messy references with a small LLM (GPT-4o-mini) to fix typos or remove non-academic items (like news articles).
    • Tried a final search using Google Scholar.
    • If a title still couldn’t be found anywhere, they marked it as “unmatched” (very likely fake).
  • To avoid blaming normal mistakes on AI, they used years before AI was common (pre-2023) as a baseline error rate. Any extra unmatched citations after AI tools became popular were treated as “hallucinated” (AI-made) at the population level.
  • They also looked for clues of AI-written text in papers and compared patterns across fields and author groups.

In short: they built a careful pipeline to find references that point to nothing, and compared “before AI” vs. “after AI” to estimate the AI effect.

What they discovered (main results)

  • Fake citations surged after AI tools became popular:
    • The sharpest rise started in mid-2024.
    • By August 2025, estimated hallucination rates reached about 0.39% (arXiv), 0.21% (bioRxiv), 1.91% (SSRN), and 0.27% (PubMed Central).
    • That adds up to a conservative estimate of around 146,932 fake citations in 2025 across just these four sources.
  • The problem is widespread, not just a few bad papers:
    • Most papers with issues had only a small number of fake citations embedded among real ones.
    • The share of papers with at least one fake reference rose across the board.
  • Strong links to AI use:
    • Fields with more AI-assisted writing—especially social sciences and computer science—had higher fake-citation rates.
    • Papers with stronger “AI writing” language patterns also had more fake citations.
  • Who tends to cite fake references?
    • More often small teams and early-career authors (people with few or no prior publications).
    • These same authors have recently increased their publication rates, which means the problem can spread faster.
  • Who gets (misplaced) credit from fake citations?
    • Sometimes the fake references include entirely made-up author names.
    • When fake references do point to real people, they disproportionately credit already prominent, highly cited, and more often male-named authors.
    • Even the valid references in papers that contain fake ones lean toward citing more famous authors—suggesting the bias isn’t just in the obvious mistakes, but also in the “good-looking” citations AI suggests.
  • Safeguards aren’t catching most of it:
    • arXiv moderation rejects more papers with fake citations than average, but most still slip through.
    • In bioRxiv papers that later got published, about 85% of fake citations survived into the final journal version.
    • Across journals, even many respected ones show some level of the problem.
    • Fake references are now appearing as entries in tools like Google Scholar, which risks making the mistakes look real and easier to copy.

Why this matters

  • Reliability: Science builds on previous work. If fake references spread, future research decisions can be based on nothing.
  • Fairness: Fake citations give extra attention to already famous (often male) scholars, making recognition in science less balanced.
  • Self-reinforcing loop: If AI tools are trained on literature that contains fake references, tomorrow’s AI could learn and repeat (or amplify) today’s mistakes.

What this means going forward (implications)

  • Science needs better guardrails:
    • Automated reference checkers and stronger editorial checks can help for citations (because they’re easy to look up).
    • Authors, especially newer researchers, should double-check AI-suggested references before submitting work.
  • But harder problems remain:
    • Checking if a real citation actually supports the claim it’s attached to is much more difficult than checking if a title exists.
    • Outside of science (like government reports or legal documents), there aren’t big databases to verify facts, so the risk may be even higher.
  • Bottom line: AI can boost productivity, especially for newer researchers, but without careful verification, it can also quietly spread errors. The paper argues for a system-level response—better tools, stronger norms, and shared responsibility—to keep the knowledge we rely on accurate and fair.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper’s analysis:

  • Causal attribution of excess unmatched citations to LLMs: The study infers LLM-induced hallucinations from post-2022 “excess” unmatched rates without a causal design. Future work should use causal identification (e.g., instrumenting LLM access, staggered policy changes, institutional bans, or model/service outages) to isolate LLM effects from concurrent changes in indexing or citation practices.
  • Cross-corpus parsing heterogeneity: Differences in input formats and parsers (LaTeX vs PDF/GROBID vs XML vs Crossref-only for SSRN) may differentially inflate unmatched rates. A harmonized, uniform re-parsing benchmark across corpora is needed to quantify and correct measurement artifacts.
  • Dependence on Google Scholar as a validator: Scholar is itself contaminated by “citation-only” ghost entries. The extent to which this feedback loop biases validation (false validations) is unknown. Independent ground truths (publisher APIs/DOI registries, curated bibliographies, librarian-verified catalogs) and stratified manual audits should be used to estimate and correct this bias.
  • Limited coverage in fields/cultures with atypical citation norms: The pipeline under-covers domains that omit titles, cite non-journal objects (datasets, software, code repos, policy docs), or use non-Latin scripts. Field- and language-extended validators and multilingual indexing are needed to assess prevalence outside title-centric, English-dominant venues.
  • Niche venues and heavily mathematical formats: Acknowledged false negatives in niche or math-heavy documents (e.g., formula-heavy titles) remain unquantified. Constructing field-specific gold standards and parsers (math-aware title extraction) would clarify under-detection.
  • Reliance on an LLM-based cleaning stage: Using GPT-4o-mini to exclude non-academic references may introduce model-specific biases. The false positive/negative rates of this cleaning step across disciplines and languages require benchmarking against human-annotated datasets.
  • Title-existence focus misses broader error classes: The study measures only non-existent titles; it does not quantify fabricated-but-plausible metadata (real title with wrong authors/year/venue) or miscitations where real sources do not support the cited claim. Methods for semantic verification against full texts and prevalence estimates of these harder errors are needed.
  • Uncertainty quantification: Large aggregate figures (e.g., 146,932 hallucinated citations) lack explicit confidence intervals per corpus and month. Future work should report uncertainty bands incorporating parsing error, validator noise, and regression model variance.
  • Field and venue confounding: Higher SSRN rates may reflect metadata-only extraction rather than true prevalence. A cross-field, within-format comparison (e.g., only full-text parsed papers) is needed to decouple discipline effects from ingestion/format artifacts.
  • Validity of LLM-use inference: The linguistic-signature proxy for AI-assisted writing may be field-biased or confounded by stylistic norms. External validation with author surveys, disclosure audits, or editorial logs would strengthen the link between LLM use and hallucination rates.
  • Micro-mechanisms of author behavior: How authors integrate LLM-suggested references (copy-paste vs partial verification, use of citation managers) is unknown. User studies, editor surveys, and instrumentation (e.g., plugin logs from literature assistants) could identify failure points in the workflow.
  • Tool-level heterogeneity: The analysis does not distinguish which LLMs, retrieval-augmented systems, or agentic assistants (and settings) most contribute to hallucinated citations. Comparative audits across tools and configurations are needed for targeted mitigations.
  • Geographic and linguistic disparities: The paper does not examine variation by country, institution, language, or access regimes. Stratifying prevalence by these factors could reveal inequities and guide localized interventions.
  • Team-size effect mechanisms: The observed attenuation with larger teams lacks causal explanation (peer checking, division of labor, senior oversight). Studies leveraging exogenous variation in team composition or internal review protocols could identify protective mechanisms.
  • Persistence through peer review: The 85.3% persistence estimate is from 2,241 bioRxiv-to-PMC cases and may not generalize across fields/journals. Larger, multi-field longitudinal tracking and replication with publisher production logs are needed.
  • Journal policy effectiveness: No analysis links hallucination rates to specific AI or reference-check policies. Correlating journal-level policies and tooling with observed rates, or piloting randomized policy/tool adoptions, could quantify efficacy.
  • Credit-allocation biases—mechanisms and mitigation: The finding that hallucinations over-credit prominent and male authors is “suggestive.” Mechanism tests (LLM preference vs author selection vs field composition), culturally robust gender inference, and debiasing interventions (reranking, diversification constraints) require study.
  • Author name disambiguation and gender inference risks: Embedding-based attribution and name-gender inference can misclassify non-Western or ambiguous names. Auditing disambiguation accuracy and adopting identity-sensitive, consent-based approaches are needed.
  • Propagation dynamics and feedback loops: The paper hypothesizes recursive contamination (literature → LLM training → more hallucinations) but does not model it. Network models and simulation (with empirical parameters) should quantify long-term amplification and tipping points.
  • Downstream scientific impact: The consequences of hallucinated citations for meta-analyses, evidence syntheses, and grant/policy decisions are unmeasured. Case studies and citation-context analyses could assess real-world harms.
  • Intervention evaluation at scale: While validators exist, their real-world performance, adoption costs, and side effects (false rejections, author burden) are unknown. Controlled deployments at preprint servers and publishers should report precision/recall, throughput, and human-in-the-loop efficacy.
  • Scope beyond science: The paper infers broader generalizability to unstructured domains (law, policy, medicine) but provides no measurements. Domain-specific audits and adaptation of verification frameworks for non-citation claims are needed.
  • Dataset and code reproducibility: Releasing code, detection thresholds, and a stratified, human-verified benchmark set (including non-English and niche sources) would enable replication and method comparison.
  • Evolution of citation objects: Increasing citation of preprints, software, datasets, and web artifacts could raise unmatched rates independently of LLMs. Better taxonomies and validators for these objects are required to avoid conflating practice shifts with hallucinations.
  • Section-level localization: The distribution of hallucinations across paper sections (e.g., introductions vs methods) is unexplored. Section-aware analyses could target editorial checks where risk concentrates.
  • Model-training contamination measurement: The extent to which hallucinated entries enter major training corpora (e.g., Common Crawl, OA repositories) and degrade model performance remains unquantified. Audits of training data pipelines and controlled retraining experiments are needed.

Practical Applications

Immediate Applications

The following applications can be deployed with current tools and infrastructure, directly leveraging the paper’s verification pipeline, findings, and risk signals.

  • Publishing and preprint platforms (science): Integrate automated reference verification into submission pipelines (e.g., arXiv, bioRxiv, SSRN, PMC, journal workflows). Tools/workflows: Elasticsearch/OpenAlex/Semantic Scholar lookup, LLM-based cleaning, Google Scholar cross-check, hard/soft fails on unverifiable references, staff dashboards highlighting papers with excess unmatched citations. Dependencies: API/data access (OpenAlex, Semantic Scholar, Crossref, PubMed, Google Scholar), accurate parsing (GROBID/LaTeX/BibTeX), handling non-English sources, publisher IT integration.
  • Editorial management systems (software for publishers): Add “reference hygiene checks” to ScholarOne/Editorial Manager/Overleaf during submission and revision. Tools/products: ReferenceGuard plugin/API, automated author notifications with fix suggestions, final acceptance contingent on zero unverifiable references or documented exceptions. Dependencies: Vendor integration, performance at scale, policy alignment with journals.
  • Research groups and labs (academia): Adopt “CI for manuscripts” that validates references on every commit or pre-submission. Tools/workflows: GitHub Actions/Overleaf plugin/Zotero add-on that verifies titles, DOIs, and author names; lab checklists for junior authors; team-based cross-checks (higher team sizes correlate with fewer hallucinations). Dependencies: Local compute or cloud credits, adoption by PIs, configuration for niche outlets.
  • University research integrity offices (academia): Mandate reference verification attestations for theses, dissertations, and internal reports; offer campus-wide validator access and training. Tools/workflows: Central license for validators; short courses on LLM-safe literature practices; templates that enforce DOI/PMID fields. Dependencies: Policy approval, budget, training capacity.
  • Funding agencies and foundations (policy): Require a machine-verification log in grant applications and final reports; include zero-tolerance rules for fabricated citations. Tools/workflows: Submission portals with automatic checks, compliance audits on sampled proposals, remediation pathways. Dependencies: Systems integration, legal review, auditor capacity.
  • Libraries and bibliometric services (information services): Quarantine “citation-only” phantom entries and flag them in discovery systems; establish a report-and-remediate loop. Tools/products: Scholar/Crossref/OpenAlex pipeline to tag unverifiable titles, curator dashboards, feedback APIs for community reports. Dependencies: Vendor cooperation, false-positive management, multilingual coverage.
  • AI writing and literature tools (software): Ship “Verified Cite” modes that only propose references resolved in trusted indices with resolvable DOIs/PMIDs and accessible landing pages. Tools/products: RAG restricted to curated bibliographies, inline DOI validation, “confidence + evidence” badges next to each suggested reference. Dependencies: Indexed corpora coverage, latency budgets, licensing.
  • LLM usage monitoring for triage (publishers, institutions): Use linguistic signatures of AI-assisted writing to route manuscripts to enhanced verification and human review. Tools/workflows: Lightweight classifier on abstracts/intros; risk-based moderation queues. Dependencies: Model calibration, fairness audits, avoidance of overreach (proxy ≠ proof of AI use).
  • Newsrooms and think tanks (media/policy): Add citation verifiers to editorial checklists for reports, white papers, and investigative pieces. Tools/workflows: Browser extensions or CMS plug-ins that validate references before publication. Dependencies: Source coverage beyond academic journals, newsroom adoption.
  • Legal practice (law): Integrate case-law and citation resolvers into drafting tools to block non-existent authorities. Tools/workflows: Bluebook-checker that resolves citations against official reporters and legal databases before filing. Dependencies: Access to PACER/Westlaw/Lexis or open alternatives; jurisdiction coverage.
  • Clinical guideline and CME providers (healthcare): Require DOI/PMID or clinical guideline IDs for every reference; auto-verify during authoring. Tools/workflows: EBM authoring platforms with inline validation; clinical librarians as approvers for edge cases. Dependencies: Medical index access, workflow changes, liability considerations.
  • Corporate compliance and risk (finance, enterprise): Pre-approve external-facing research and policy docs with automated reference audits; implement a “hallucination budget” threshold that triggers escalation. Tools/workflows: DLP-like gates in document management systems; compliance dashboards tracking organization-wide citation hygiene. Dependencies: Integration with M365/Google Workspace, governance policy updates.
  • DEI and research equity teams (academia/policy): Audit LLM-driven citation suggestions for skew toward already-prominent and male-name authors; re-rank suggestions to diversify valid citations. Tools/workflows: Recommenders that balance relevance with diversity metrics; author name disambiguation plus demographic inference where lawful and ethical. Dependencies: Ethical review, bias-measurement validity, opt-in use.

Long-Term Applications

These require further research, scaling, standardization, or technical advances beyond current citation-existence checks.

  • Content-consistency verification at scale (science, healthcare, law, policy): Move from “does the citation exist?” to “does the cited source actually support the claim?”. Tools/products: Agentic claim-to-evidence verifiers that retrieve, read, and align specific passages to claims; contradiction detectors; structured argument graphs. Dependencies: Reliable full-text access, robust source parsing across formats/languages, evaluation standards, compute.
  • Cross-industry standards for verifiable citations (policy, standards bodies): Establish minimum citation metadata (e.g., DOI/PMID/court reporter ID), machine-readable reference sections, and audit trails as publication requirements. Tools/workflows: JATS/Docx/PDF-A profiles with mandatory persistent IDs; standards from NISO/ISO; badges for “verified references.” Dependencies: Multi-stakeholder consensus, retrofitting legacy pipelines, global adoption.
  • Bibliometric infrastructure upgrades (information services): Persistent “quarantine” layers in indexes (Crossref, OpenAlex, Scholar-like systems) to suppress propagation of phantom entries; backfill cleansing at scale. Tools/products: Anti-contamination pipelines, provenance tracking for references, versioned corrections. Dependencies: Collaboration across providers, incentives to clean historical data, user notification channels.
  • Model training data governance (AI developers): Filter training corpora to downweight or exclude unverifiable citations and contaminated documents; track “data lineage” to prevent model collapse via recursive errors. Tools/workflows: Data contracts, contamination scores, per-example provenance; differential weighting in pretraining; continual data audits. Dependencies: Access to source metadata, scalable filters, alignment with privacy and TOS.
  • Verified-citation generation as a default capability (software/LLMs): Constrain model outputs to cite only items retrieved and verified from authoritative indices, with links that resolve and pass liveness checks. Tools/products: Toolformer-style API calling with verified indices, real-time DOI resolution checks, refusal policies when evidence is absent. Dependencies: Tool reliability, latency budgets, coverage gaps and fallback behavior.
  • Journal and funder transparency metrics (academia/policy): Public dashboards reporting “hallucination rates” by venue, field, and time; include remediation KPIs (detection-to-correction times). Tools/workflows: Open metrics pipelines; incentives via funder mandates; badges for exemplary performance. Dependencies: Data sharing agreements, field normalization, risk of metric gaming.
  • Career-stage support to reduce error propagation (academia): Targeted tooling and mentoring for early-career and small teams (highest risk in the study) to reduce inadvertent hallucinations. Tools/workflows: Co-author verification bot, departmental peer-review pools, microgrants for editorial assistance. Dependencies: Funding, mentor bandwidth, adoption incentives.
  • Claim provenance in government and corporate documents (policy, enterprise): Require machine-verifiable claim provenance in high-stakes documents (e.g., policy memos, ESG reports). Tools/workflows: Prose-to-source annotations, immutable logs, and verifiable audit trails shared with regulators or the public. Dependencies: Standardized schemas, regulator buy-in, confidentiality handling.
  • Education and assessment redesign (education): Embed verifiable sourcing into curricula and grading rubrics; teach LLM-aware literature practices and citation audits. Tools/workflows: LMS integrations that validate references on submission; exercises on de-hallucinating model outputs. Dependencies: Faculty training, fair-use of APIs, student privacy.
  • Insurance and professional liability products (legal/healthcare/enterprise): Risk scoring and coverage tied to documented verification practices (lower premiums for rigorous citation hygiene). Tools/workflows: Verification attestations as part of underwriting; periodic audits. Dependencies: Actuarial evidence linking hygiene to loss reduction, standards for audits.
  • Field-specific knowledge indices for non-academic domains (journalism, policy, finance): Build structured, queryable repositories of vetted sources (e.g., newsroom source registries, policy doc repositories) to enable existence checks beyond academia. Tools/products: Domain RAG indices, curator workflows, shared APIs. Dependencies: Curation costs, licensing, governance.
  • Authorship and credit fairness tools (academia): Counteract LLM-induced skew by re-ranking suggested valid citations for author/team diversity and by surfacing high-relevance work from less-prominent scholars. Tools/products: Recommenders with fairness constraints; author disambiguation with field embeddings (e.g., SPECTER2). Dependencies: Consensus on fairness objectives, risk of introducing new biases.
  • End-to-end “trust layers” for document ecosystems (software): Add cryptographic provenance (e.g., C2PA-like) and machine-verification stamps to references and claims throughout the authoring-to-publication lifecycle. Tools/products: Signed artifacts for reference lists, verifiable build logs for manuscripts, tamper-evident corrections. Dependencies: Standards alignment, key management, widespread tool support.
  • Regulatory frameworks for AI-assisted authorship (policy): Define required disclosures, verification thresholds, and penalties for fabricated citations in regulated filings (e.g., clinical submissions, securities research). Tools/workflows: Compliance checkers integrated into eCTD/EDGAR pipelines; third-party audits. Dependencies: Lawmaking, sector nuance, enforcement capacity.
  • Cross-lingual and low-resource verification (global research): Extend verification to non-English and niche venues to reduce bias and false positives in global scholarship. Tools/workflows: Multilingual OCR/NER, local indices, collaboration with regional repositories. Dependencies: Data partnerships, funding, evaluation datasets.
  • Domain-adapted moderation triage models (publishers/platforms): Jointly predict LLM-use signatures and hallucination risk to prioritize human moderation where it matters most. Tools/workflows: Multi-task classifiers trained on verified labels; active learning with moderator feedback. Dependencies: Labeled data, privacy-preserving training, continuous evaluation.
  • Longitudinal contamination monitoring (ecosystem health): Track the growth and decay of phantom citations across time, venues, and models; early-warning signals for recursive training collapse. Tools/workflows: Periodic corpus scans, contamination indices, shared reports across model labs and publishers. Dependencies: Data sharing, standardized measurement, governance agreements.

Glossary

  • Agentic research assistants: Autonomous AI tools that plan and execute research tasks (e.g., finding and inserting citations) with minimal human prompting. "include AI search and agentic research assistants that automate the generation of citations from live web content."
  • Agentic verification tools: Automated systems that proactively check, trace, and validate claims or references in scholarly content. "A growing ecosystem of agentic verification tools-automated reference validators [39], observability platforms, and evaluation frameworks-offers cautious optimism..."
  • arXiv: A large open-access preprint repository for physics, mathematics, computer science, and related fields. "arXiv, covering 1,465,145 preprints across mathematical, physical, and computational sciences (Jan 2020-Aug 2025)."
  • Author-ordering conventions: Field-specific norms governing the order of authors on a paper, often signaling contribution and seniority. "They also deviate from established author-ordering conventions."
  • Automated reference validators: Software that programmatically checks whether citations exist and are correctly formatted. "automated reference validators [39]"
  • Bibliographic databases: Structured collections indexing scholarly works and their metadata. "large-scale bibliographic databases"
  • Bibliometric databases: Databases focused on quantitative analysis of publications and citations for measuring scholarly impact. "The accumulation of hallucinated citations within bibliometric databases may begin to erode the mechanisms we have to detect them."
  • Bibliometric datasets: Large-scale datasets used for quantitative study of publications, authors, and citations. "we collect and examine four bibliometric datasets (SI S1, S3.4)"
  • bioRxiv: A preprint server for the biological and life sciences. "bioRxiv (261,928 preprints), which spans a broad range of biological and life science fields."
  • Citation-only entries: Bibliographic records that appear as references in databases despite not corresponding to real publications. "Growth of citation-only entries."
  • Citation propagation: The process by which references spread and are reused across subsequent papers. "citation propagation is path-dependent:"
  • Crossref: A DOI registration agency and infrastructure provider offering metadata and citation links for scholarly content. "we retrieve metadata totaling 26,815,043 citations via Crossref"
  • DOI (Digital Object Identifier): A persistent alphanumeric identifier for uniquely locating digital scholarly objects. "SSRN DOI"
  • Elasticsearch: A distributed search engine used to index and retrieve documents at scale. "a locally hosted Elasticsearch index"
  • Embeddings: Vector representations of texts used to measure similarity and support tasks like author disambiguation. "we construct text-based, paper-level embeddings [28]"
  • Evaluation frameworks: Structured protocols and metrics to assess the performance and reliability of AI systems or tools. "observability platforms, and evaluation frameworks-"
  • First-last author hierarchy: A convention where first author indicates primary contributor and last author often denotes senior leadership. "this first-last author hierarchy is 12 percentage points weaker."
  • Foundation models: Large-scale pretrained models used as general-purpose bases for downstream tasks. "for the technical community that develops foundation models"
  • GROBID: An open-source tool that extracts and structures bibliographic information from PDFs. "apply GROBID to recover citations"
  • Journal impact percentile: A ranking of journals by impact metrics relative to peers, expressed as percentiles. "by journal impact percentile."
  • LaTeX format: A typesetting system commonly used for scientific documents and bibliographies. "LaTeX format"
  • Observability platforms: Systems that monitor, log, and trace AI/data pipelines to improve reliability and transparency. "observability platforms"
  • OpenAlex: An open bibliographic database of scholarly works, authors, venues, and citations. "Semantic Scholar and OpenAlex."
  • Path-dependent citation networks: Citation structures where current referencing patterns are influenced by earlier citations, reinforcing existing paths. "path-dependent citation networks and model training loops."
  • Pearson correlation: A statistic measuring linear correlation between two variables. "Pearson correlation r=0.441, P<0.001"
  • Preprint archives: Repositories hosting manuscripts prior to peer review. "on the preprint archives, editorial and peer review might still catch them"
  • Preprint moderation: Screening processes (often a mix of automated and human review) that vet submissions before posting on preprint servers. "Preprint moderation and journal publication processes capture only a fraction of these errors"
  • PubMed Central (PMC): A free full-text archive of biomedical and life sciences journal literature. "PubMed Central (PMC), a leading corpus of peer-reviewed, full-text journal publications."
  • Regression framework: A statistical modeling approach for estimating relationships and trends over time or across variables. "using a regression framework (S2.2)."
  • Retrieval-augmented generation systems: LLM-based systems that integrate external document retrieval to ground generated outputs in source material. "retrieval-augmented generation systems"
  • Semantic Scholar: An AI-powered scholarly search engine and metadata platform. "Semantic Scholar and OpenAlex."
  • SSRN (Social Science Research Network): A preprint repository for the social sciences, law, and humanities. "Social Science Research Network (SSRN, 421,698 preprints)"
  • String similarity-based criteria: Algorithmic methods that compare text strings to determine matches (e.g., titles). "using string similarity-based criteria."
  • Team science: The trend toward collaborative, multi-author research as the dominant mode of knowledge production. "the increasing dominance of team science [31]"
  • Unmatched citations: References that cannot be verified against authoritative databases and may indicate hallucinations. "we track the rate and characteristics of unmatched citations over time"
  • XML-formatted citation records: Citations encoded using the XML markup standard for structured data exchange. "XML-formatted citation records"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 39 tweets with 5413 likes about this paper.

HackerNews

  1. LLM Hallucinations in the Wild (4 points, 1 comment)