HealthFC: Verifying Health Claims with Evidence-Based Medical Fact-Checking (2309.08503v2)
Abstract: In the digital age, seeking health advice on the Internet has become a common practice. At the same time, determining the trustworthiness of online medical content is increasingly challenging. Fact-checking has emerged as an approach to assess the veracity of factual claims using evidence from credible knowledge sources. To help advance automated NLP solutions for this task, in this paper we introduce a novel dataset HealthFC. It consists of 750 health-related claims in German and English, labeled for veracity by medical experts and backed with evidence from systematic reviews and clinical trials. We provide an analysis of the dataset, highlighting its characteristics and challenges. The dataset can be used for NLP tasks related to automated fact-checking, such as evidence retrieval, claim verification, or explanation generation. For testing purposes, we provide baseline systems based on different approaches, examine their performance, and discuss the findings. We show that the dataset is a challenging test bed with a high potential for future use.
- Overview of the mediqa 2021 shared task on summarization in the medical domain. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 74–85.
- Alfred V. Aho and Jeffrey D. Ullman. 1972. The Theory of Parsing, Translation and Compiling, volume 1. Prentice-Hall, Englewood Cliffs, NJ.
- American Psychological Association. 1983. Publications Manual. American Psychological Association, Washington, DC.
- Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853.
- Galen Andrew and Jianfeng Gao. 2007. Scalable training of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-regularized log-linear models. In Proceedings of the 24th International Conference on Machine Learning, pages 33–40.
- MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4685–4697, Hong Kong, China. Association for Computational Linguistics.
- SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
- Alternation. Journal of the Association for Computing Machinery, 28(1):114–133.
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- James W. Cooley and John W. Tukey. 1965. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90):297–301.
- Michael Crawshaw. 2020. Multi-task learning with deep neural networks: A survey. CoRR, abs/2009.09796.
- Consumer health information and question answering: helping consumers find answers to their health-related information needs. Journal of the American Medical Informatics Association, 27(2):194–201.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, Online. Association for Computational Linguistics.
- Evidence inference 2.0: More data, better models. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pages 123–132, Online. Association for Computational Linguistics.
- JL Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378—382.
- Susannah Fox and Maeve Duggan. 2013. Health online 2013. Health, 2013:1–55.
- Applying grading of recommendations assessment, development and evaluation (grade) to diagnostic tests was challenging but doable. Journal of Clinical Epidemiology, 67(7):760–768.
- A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206.
- Ashim Gupta and Vivek Srikumar. 2021. X-fact: A new benchmark dataset for multilingual fact checking. arXiv preprint arXiv:2106.09248.
- Dan Gusfield. 1997. Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge, UK.
- A richly annotated corpus for different tasks in automated fact-checking. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 493–503.
- Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
- Most healthcare interventions tested in cochrane reviews are not effective according to high quality evidence: a systematic review and meta-analysis. Journal of clinical epidemiology, 148:160–169.
- CHEF: A pilot Chinese dataset for evidence-based fact-checking. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3362–3376, Seattle, United States. Association for Computational Linguistics.
- Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
- PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. Association for Computational Linguistics.
- Biomedical question answering: a survey of approaches and challenges. ACM Computing Surveys (CSUR), 55(2):1–36.
- NLI4CT: Multi-evidence natural language inference for clinical trial reports. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16745–16764, Singapore. Association for Computational Linguistics.
- Cochrane database of systematic reviews.
- Neema Kotonya and Francesca Toni. 2020. Explainable automated fact-checking for public health claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7740–7754, Online. Association for Computational Linguistics.
- Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data, 10:170.
- Shafreena Kühn and Ulrich M Rieger. 2017. Health is a state of complete physical, mental and social well-being and not merely absence of disease or infirmity. Surg. Obes. Relat. Dis., 13(5):887.
- Biomistral: A collection of open-source pretrained large language models for medical domains.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- Expertqa: Expert-curated questions and attributed answers.
- Iain J Marshall and Byron C Wallace. 2019. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Systematic reviews, 8:1–10.
- Covert: A corpus of fact-checked biomedical covid-19 tweets. In Proceedings of the Language Resources and Evaluation Conference, pages 244–257, Marseille, France. European Language Resources Association.
- Automated fact-checking for assisting human fact-checkers. In International Joint Conference on Artificial Intelligence.
- Fighting covid-19 misinformation on social media: Experimental evidence for a scalable accuracy-nudge intervention. Psychological Science, 31(7):770–780. PMID: 32603243.
- Steven Piantadosi. 2017. Clinical trials: a methodologic perspective. John Wiley & Sons.
- Detecting contradictions in german text: A comparative study. In 2021 IEEE Symposium Series on Computational Intelligence (SSCI), pages 01–07. IEEE.
- Mohammad Sadegh Rasooli and Joel R. Tetreault. 2015. Yara parser: A fast and accurate dependency parser. Computing Research Repository, arXiv:1503.06733. Version 2.
- Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Comput. Surv., 55(10).
- Alexey Romanov and Chaitanya Shivade. 2018. Lessons from natural language inference in the clinical domain. arXiv preprint arXiv:1808.06752.
- COVID-fact: Fact extraction and verification of real-world claims on COVID-19 pandemic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2116–2129, Online. Association for Computational Linguistics.
- Evidence-based fact-checking of health-related claims. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3499–3512, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Acceptability of healthcare interventions: an overview of reviews and development of a theoretical framework. BMC health services research, 17(1):1–13.
- Large language models encode clinical knowledge. Nature.
- The fact extraction and VERification (FEVER) shared task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 1–9, Brussels, Belgium. Association for Computational Linguistics.
- Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT — Building open translation services for the World. In Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT), Lisbon, Portugal.
- Juraj Vladika and Florian Matthes. 2023a. Scientific fact-checking: A survey of resources and approaches. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6215–6230, Toronto, Canada. Association for Computational Linguistics.
- Juraj Vladika and Florian Matthes. 2023b. Sebis at SemEval-2023 task 7: A joint system for natural language inference and evidence retrieval from clinical trial reports. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 1863–1870, Toronto, Canada. Association for Computational Linguistics.
- Juraj Vladika and Florian Matthes. 2024. Comparing knowledge sources for open-domain scientific claim verification. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2103–2114, St. Julian’s, Malta. Association for Computational Linguistics.
- TUM sebis at GermEval 2022: A hybrid model leveraging Gaussian processes and fine-tuned XLM-RoBERTa for German text complexity analysis. In Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text, pages 51–56, Potsdam, Germany. Association for Computational Linguistics.
- Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online. Association for Computational Linguistics.
- MultiVerS: Improving scientific claim verification with weak supervision and full-document context. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 61–76, Seattle, United States. Association for Computational Linguistics.
- RedHOT: A corpus of annotated medical questions, experiences, and claims on social media. In Findings of the Association for Computational Linguistics: EACL 2023, pages 809–827, Dubrovnik, Croatia. Association for Computational Linguistics.
- Pmc-llama: Towards building open-source language models for medicine.
- What makes medical claims (un)verifiable? analyzing entity and relation properties for fact verification.
- John Zarocostas. 2020. How to fight an infodemic. The lancet, 395(10225):676.
- Meddialog: Large-scale medical dialogue datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9241–9250.
- A neural multi-task learning framework to jointly model medical named entity recognition and normalization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 817–824.
- Juraj Vladika (21 papers)
- Phillip Schneider (16 papers)
- Florian Matthes (79 papers)