Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models (2405.09454v1)

Published 15 May 2024 in cs.CL

Abstract: This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of LLMs to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Where is your evidence: Improving fact-checking by justification modeling. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 85–90, Brussels, Belgium. Association for Computational Linguistics.
  2. Generating fact checking explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7352–7364, Online. Association for Computational Linguistics.
  3. Explainable assessment of healthcare articles with QA. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 1–9, Dublin, Ireland. Association for Computational Linguistics.
  4. Rajeshree Bora-Kathariya and Yashodhara Haribhakta. 2018. Natural language inference as an evaluation measure for abstractive summarization. In 2018 4th International Conference for Convergence in Technology (I2CT), pages 1–4.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  6. Generating literal and implied subquestions to fact-check complex claims. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3495–3516, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  7. Ginger cannot cure cancer: Battling fake health news with a comprehensive data repository. In International Conference on Web and Social Media.
  8. Qlora: Efficient finetuning of quantized llms.
  9. SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  10. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206.
  11. Exclaim: Explainable neural claim verification using rationalization. In 2022 IEEE 29th Annual Software Technology Conference (STC), pages 19–26.
  12. Mistral 7b.
  13. Neema Kotonya and Francesca Toni. 2020a. Explainable automated fact-checking: A survey. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5430–5443, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  14. Neema Kotonya and Francesca Toni. 2020b. Explainable automated fact-checking for public health claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7740–7754, Online. Association for Computational Linguistics.
  15. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  16. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9).
  17. Roberta: A robustly optimized bert pretraining approach.
  18. Local interpretations for explainable natural language processing: A survey. ArXiv, abs/2103.11072.
  19. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
  20. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  21. OpenAI. 2023a. Gpt-4 technical report.
  22. OpenAI. 2023b. Large language model.
  23. A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2249–2255, Austin, Texas. Association for Computational Linguistics.
  24. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  25. Evidence-based fact-checking of health-related claims. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3499–3512, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  26. Averitec: A dataset for real-world claim verification with evidence from the web.
  27. Dominik Stammbach and Elliott Ash. 2020. e-fever: Explanations and summaries forautomated fact checking. In Conference for Truth and Trust Online.
  28. Llama 2: Open foundation and fine-tuned chat models.
  29. Healthfc: A dataset of health claims for evidence-based medical fact-checking.
  30. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  31. Benchmarking large language models for news summarization.
  32. A survey of large language models.
  33. Judging llm-as-a-judge with mt-bench and chatbot arena.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Majid Zarharan (4 papers)
  2. Pascal Wullschleger (3 papers)
  3. Babak Behkam Kia (1 paper)
  4. Mohammad Taher Pilehvar (43 papers)
  5. Jennifer Foster (24 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets