Medical Fact-Checking Datasets
- Medical fact-checking datasets are structured resources that annotate granular claims with supporting evidence to detect misinformation in health-related contexts.
- They are created through expert annotation, automated claim generation, and LLM-based synthetic data augmentation to enhance verifiability and retrieval.
- Evaluation metrics such as F1 score, precision/recall, and Cohen’s κ are used to assess tasks like claim-evidence alignment and explanation generation.
Medical fact-checking datasets are structured resources curated to facilitate the detection, verification, and analysis of factual accuracy in health-related information. These datasets encompass domain-specific veracity judgments, supporting evidence, granular claim annotations (often at the level of atomic facts), and typically target the detection of misinformation in news, social media, clinical text, public health messaging, and LLM-generated medical content. They form the foundation for data-driven research in automated medical misinformation detection, robust fact verification systems, explainable AI, and evaluation benchmarks for both retrieval and generation tasks in medical NLP.
1. Dataset Types, Scope, and Construction
Medical fact-checking datasets exhibit significant diversity in terms of source material, granularity, and annotation paradigms:
| Dataset | Domain Focus | Claim Types | Evidence | Annotation Granularity |
|---|---|---|---|---|
| FakeCovid (Shahi et al., 2020) | Multilingual COVID-19 news | News articles | Fact-check URLs | Multilabel, 11 categories |
| PUBHEALTH (Kotonya et al., 2020) | Public health claims | General, policy, biomed | Fact-check explanations | Four-class, gold explanations |
| COVID-Fact (Saakyan et al., 2021) | COVID-19, general/science | Reddit/news, auto-generated | Peer-review, lay evidence | Sentence-level, FEVER-style |
| CoVERT (Mohr et al., 2022) | Biomed, COVID-19 tweets | Social media | Web evidence | Entities, relations, verdicts |
| Monant (Srba et al., 2022) | Medical news/blogs | Cross-source, media | News | Claim-article mappings |
| BEAR-Fact (Wührl et al., 2 Feb 2024) | Scientific, social media | Biomedical entity-relation-object | PubMed | Structured triplets, verifiability |
| HealthFC (Vladika et al., 2023, Barone et al., 17 Sep 2025) | Evidence-based medicine | Clinical/consumer | Systematic reviews | Evidence spans, graded scores |
| FActBench (Afzal et al., 2 Sep 2025) | Biomedical LLM evaluation | Generated summaries, answers | Grounding docs, Wikipedia | Decomposed atomic facts |
| MedFact (Chinese) (He et al., 15 Sep 2025, Chen et al., 22 Sep 2025) | Chinese medical texts, LLMs | Human/LLM-generated | Web, medical sources | Error types, error localization |
Approaches to dataset creation include (i) direct collection and expert annotation (e.g., PUBHEALTH, Check-COVID, HealthFC), (ii) automatic claim and counter-claim generation (e.g., COVID-Fact), (iii) mapping between claims and full articles (e.g., Monant) for stance detection, and (iv) decomposition of model-generated responses into minimal units of verifiable content (“atomic facts” (Vladika et al., 30 May 2025, Afzal et al., 2 Sep 2025)).
Synthetic data augmentation via LLMs has emerged as a method to alleviate data scarcity: a prominent technique involves LLM-generated summaries, atomic fact decomposition, entailment table creation, and proportional pairing to generate supplementary text–claim pairs with binary veracity labels (Zhang et al., 28 Aug 2025).
2. Annotation Frameworks and Fact-Checking Tasks
Annotation protocols are dataset-specific and may include:
- Multiclass veracity classification (e.g., true, false, mixture, unproven/NEI)
- Fine-grained error type annotation (conceptual, terminological, temporal, citation) (He et al., 15 Sep 2025)
- Atomic fact extraction, claim–evidence alignment, and entailment labeling (Vladika et al., 30 May 2025, Afzal et al., 2 Sep 2025)
- Fact-checking verdicts supplemented by rationale or supporting explanations [PUBHEALTH, HealthFC]
- Claim–stance and claim–presence mappings for cross-source analysis (Monant)
- Structured subject–predicate–object triplets and verifiability flags (BEAR-Fact)
Core tasks defined using these datasets include:
- Fact-checking classification (e.g., SUPPORT, REFUTE, NOTENOUGHINFO)
- Evidence retrieval (sentence/document selection maximizing relevance)
- Stance detection and claim presence analysis for document–claim pairs
- Error localization (identifying erroneous spans in text)
- Explanation generation (deriving human-understandable rationales for verdicts)
- Factuality assessment of LLM-generated content, especially in multi-stage QA (Vladika et al., 30 May 2025, Afzal et al., 2 Sep 2025)
Formal task formulations often use mappings of claim–evidence pairs to label sets, for instance,
where is set of claims, is set of evidence, and is the label set (e.g., {supported, partially supported, refuted, uncertain, not applicable}) (Chen et al., 22 Sep 2025).
3. Methodological Innovations and Evaluation Models
Medical fact-checking datasets have driven advances in retrieval-augmented generation, natural language inference (NLI), and chain-of-thought (CoT) prompting for evaluation. Notable methodologies include:
- Sentence-BERT (S-BERT) and SBERT-based query/evidence encoding with cosine similarity for evidence match scoring (Kotonya et al., 2020, Saakyan et al., 2021, Vladika et al., 2023)
- Multi-stage filtering using lexicons, entity normalization, and query refinement to increase claim verifiability (BEAR-Fact)
- Semantic health knowledge graph construction and graph-based retrieval-augmented generation (GraphRAG) in TrumorGPT (Hang et al., 11 May 2025)
- Cross-modal contrastive regression for fact-checking of vision-LLMs (chest X-ray reports) (Mahmood et al., 3 Dec 2024)
- Systematic use of authoritative knowledge bases (oncological guidelines, systematic reviews) as gold standards for fact validation (Vladika et al., 2023, Vladika et al., 30 May 2025)
- Fine-tuning state-of-the-art architectures (BERT, BioBERT, SCIBERT, RoBERTa, DeBERTa, Meditron3) for both binary and multiclass fact-checking tasks
- Unanimous voting ensembles that admit correctness in atomic facts only if both CoT and NLI agree (Afzal et al., 2 Sep 2025)
Evaluation metrics span macro F1, accuracy, precision/recall, balanced accuracy, Cohen’s κ, Gwet’s AC1, as well as end-to-end measures that require both correct evidence retrieval and veracity labeling (e.g., COVID-FEVER score (Saakyan et al., 2021)).
4. Domain-Specific Challenges and Observed Limitations
Medical fact-checking presents unique obstacles:
- Multilinguality and cross-domain composition (FakeCovid’s 40 languages and 105 countries (Shahi et al., 2020))
- Domain expertise required for claim understanding, fine-grained entity/relation annotation, and evidence interpretation (e.g., PUBHEALTH, HealthFC)
- Verifiability issues—negated or underspecified claims are notably difficult to support or refute (BEAR-Fact, F1 = 0.27 on unverifiable class (Wührl et al., 2 Feb 2024))
- Dataset imbalance: certain verdicts (e.g., NEI, rare errors) are underrepresented, complicating supervised training
- Model and data mismatches: fact-checking systems trained on scientific, short, atomic claims struggle with long-form, context-rich, or ambiguous text typical of clinical notes and social media (Wührl et al., 2022)
- Over-criticism: LLMs with multi-agent or extended reasoning prompt strategies tend to over-flag correct information as erroneous (recall > 0.95, precision decreased, F1 only marginally improved) (He et al., 15 Sep 2025)
- Temporal drift and knowledge evolution: annotated evidence validity degrades as medical knowledge progresses (CoVERT, chest X-ray fact-checking, LLM-generated medical content)
5. Impact, Benchmark Results, and Applications
Benchmark results across datasets establish important baselines and demonstrate the value of in-domain training. For example:
- FakeCovid's BERT classifier yields an F1 of 0.76 (false class: 0.65, others: 0.80) for COVID-19 fake news (Shahi et al., 2020)
- PUBHEALTH finds in-domain encoder models (SCIBERT, BIOBERT v1.1) outperform generic BERT for public health veracity prediction (Kotonya et al., 2020)
- COVID-Fact’s automated construction and FEVER-style tasks facilitate high-throughput evaluation of information verification in rapidly evolving health crises (Saakyan et al., 2021)
- FActBench shows that atomic fact decomposition with Unanimous Voting (CoT+NLI) correlates best with domain expert ratings (Cohen's κ = 0.75) (Afzal et al., 2 Sep 2025)
- LLMs remain challenged on factual medical knowledge retention and are poorly calibrated with respect to rare conditions (MKJ dataset, (Li et al., 20 Feb 2025))
Practical applications include early-stage false claim screening during public health emergencies, fact-checking of LLM-generated clinical summaries against electronic health records (Chung et al., 28 Jan 2025), support for misinformation detection in social media streams, error detection and correction in automated radiology reporting (Mahmood et al., 3 Dec 2024), knowledge graph construction, and quality assurance for medical dialogue systems and QA pipelines.
6. Future Directions and Open Research Questions
Persistent challenges and future research directions include:
- Improving the detection and handling of unverifiable and negated claims, which remain bottlenecks for both dataset annotation and model generalization (Wührl et al., 2 Feb 2024)
- Automatic entity/relation extraction for real-world, noisy text and clinical narratives (Wührl et al., 2022)
- Reducing over-criticism and improving the calibration of LLM-based fact-checkers under advanced multi-agent or inference-time scaling strategies (He et al., 15 Sep 2025)
- Diversifying evidentiary sources (e.g., PubMed, Wikipedia, guideline databases, real-time news, knowledge graphs) to enhance retrieval-reasoning architectures (Hang et al., 11 May 2025, Barone et al., 17 Sep 2025)
- Refinement of atomic fact extraction and multi-hop evidence integration in LLM-generated responses for comprehensive fact-level explainability (Vladika et al., 30 May 2025)
- Scaling fact-checking datasets to more languages and clinical specialties (notably the robust coverage in Chinese: MedFact (He et al., 15 Sep 2025, Chen et al., 22 Sep 2025))
- Integrating synthetic data augmentation methods to remedy annotation scarcity (Zhang et al., 28 Aug 2025) and exploring approaches for dynamic updating as medical knowledge evolves
A plausible implication is that the next phase of medical fact-checking research will rely on continued co-evolution of dataset construction techniques, advanced retrieval-augmented and reasoning-enabled LLM pipelines, and the development of increasingly granular, interpretable, and efficiently updatable evaluation resources.