Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models (2405.09454v1)
Abstract: This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of LLMs to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.
- Where is your evidence: Improving fact-checking by justification modeling. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 85–90, Brussels, Belgium. Association for Computational Linguistics.
- Generating fact checking explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7352–7364, Online. Association for Computational Linguistics.
- Explainable assessment of healthcare articles with QA. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 1–9, Dublin, Ireland. Association for Computational Linguistics.
- Rajeshree Bora-Kathariya and Yashodhara Haribhakta. 2018. Natural language inference as an evaluation measure for abstractive summarization. In 2018 4th International Conference for Convergence in Technology (I2CT), pages 1–4.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Generating literal and implied subquestions to fact-check complex claims. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3495–3516, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Ginger cannot cure cancer: Battling fake health news with a comprehensive data repository. In International Conference on Web and Social Media.
- Qlora: Efficient finetuning of quantized llms.
- SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
- A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206.
- Exclaim: Explainable neural claim verification using rationalization. In 2022 IEEE 29th Annual Software Technology Conference (STC), pages 19–26.
- Mistral 7b.
- Neema Kotonya and Francesca Toni. 2020a. Explainable automated fact-checking: A survey. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5430–5443, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Neema Kotonya and Francesca Toni. 2020b. Explainable automated fact-checking for public health claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7740–7754, Online. Association for Computational Linguistics.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9).
- Roberta: A robustly optimized bert pretraining approach.
- Local interpretations for explainable natural language processing: A survey. ArXiv, abs/2103.11072.
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
- Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- OpenAI. 2023a. Gpt-4 technical report.
- OpenAI. 2023b. Large language model.
- A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2249–2255, Austin, Texas. Association for Computational Linguistics.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Evidence-based fact-checking of health-related claims. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3499–3512, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Averitec: A dataset for real-world claim verification with evidence from the web.
- Dominik Stammbach and Elliott Ash. 2020. e-fever: Explanations and summaries forautomated fact checking. In Conference for Truth and Trust Online.
- Llama 2: Open foundation and fine-tuned chat models.
- Healthfc: A dataset of health claims for evidence-based medical fact-checking.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
- Benchmarking large language models for news summarization.
- A survey of large language models.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Majid Zarharan (4 papers)
- Pascal Wullschleger (3 papers)
- Babak Behkam Kia (1 paper)
- Mohammad Taher Pilehvar (43 papers)
- Jennifer Foster (24 papers)