Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EX-FEVER: A Dataset for Multi-hop Explainable Fact Verification (2310.09754v3)

Published 15 Oct 2023 in cs.AI

Abstract: Fact verification aims to automatically probe the veracity of a claim based on several pieces of evidence. Existing works are always engaging in accuracy improvement, let alone explainability, a critical capability of fact verification systems. Constructing an explainable fact verification system in a complex multi-hop scenario is consistently impeded by the absence of a relevant, high-quality dataset. Previous datasets either suffer from excessive simplification or fail to incorporate essential considerations for explainability. To address this, we present EXFEVER, a pioneering dataset for multi-hop explainable fact verification. With over 60,000 claims involving 2-hop and 3-hop reasoning, each is created by summarizing and modifying information from hyperlinked Wikipedia documents. Each instance is accompanied by a veracity label and an explanation that outlines the reasoning path supporting the veracity classification. Additionally, we demonstrate a novel baseline system on our EX-FEVER dataset, showcasing document retrieval, explanation generation, and claim verification, and validate the significance of our dataset. Furthermore, we highlight the potential of utilizing LLMs in the fact verification task. We hope our dataset could make a significant contribution by providing ample opportunities to explore the integration of natural language explanations in the domain of fact verification.

EX-FEVER: Pioneering the Way in Multi-hop Explainable Fact Verification

Introduction

With the proliferation of digital information, the necessity for reliable fact verification systems has become increasingly evident. The EX-FEVER dataset emerges as a response to the critical need for high-quality data to facilitate research in multi-hop explainable fact verification. This dataset introduces over 60,000 claims necessitating 2-hop and 3-hop reasoning, each with a designated veracity label and an explanation delineating the reasoning path. Through the development of a novel baseline system and the demonstration of LLMs' (LLMs) potential in fact verification, EX-FEVER sets the stage for significant advancements in the domain.

Dataset Overview

EX-FEVER differentiates itself by focusing on multi-hop reasoning with a strong emphasis on explainability. The dataset includes claims generated by summarizing and modifying information from hyperlinked Wikipedia documents, each accompanied by a veracity label (SUPPORTS, REFUTES, NOT ENOUGH INFO) and a detailed explanation. These explanations are pivotal, providing insights into the reasoning behind the veracity classification. The meticulous construction of EX-FEVER involved crowd workers, ensuring high-quality and varied examples that mirror the complexity and nuances of real-world data.

Baseline System Evaluation

The baseline system, composed of document retrieval, explanation generation, and claim verification stages, serves as a testament to the robustness and applicability of the EX-FEVER dataset. The performance of the system underscores the challenges in multi-hop fact verification, especially in document retrieval and the integration of explanations into the verification process.

Notably, the examination revealed a bottleneck in document retrieval, emphasizing the significance of effective multi-hop design retrieval models. Furthermore, the analysis of verdict prediction highlighted the limitations of existing fact-checking models, advocating for more sophisticated approaches to accommodate the intricacies of multi-hop reasoning.

LLMs in Fact Verification

A compelling aspect of the paper is the exploration of LLMs for fact verification. The investigation unveils LLMs' proficiency as planners in generating explanations, rather than directly making predictions. This nuanced finding points to the future of fact verification, where LLMs may play a pivotal role in augmenting human efforts through efficient program guides, thereby enhancing both the efficiency and reliability of fact-checking systems.

Final Thoughts

EX-FEVER represents a significant stride forward in the quest for advanced multi-hop explainable fact verification systems. By offering a comprehensive dataset that challenges current methodologies and highlights the potential of LLMs, this work paves the way for future research endeavors. It invites a reevaluation of existing approaches and fuels the development of innovative solutions that can tackle the complexities of multi-hop reasoning and explainability in fact verification. As the landscape of digital information continues to evolve, the contributions of EX-FEVER will undoubtedly influence the trajectory of fact-checking research, steering it towards more accountable, transparent, and reliable systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Where is your evidence: Improving fact-checking by justification modeling. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 85–90.
  2. Generating Fact Checking Explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7352–7364, Online. Association for Computational Linguistics.
  3. Language models are few-shot learners.
  4. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.
  5. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada. Association for Computational Linguistics.
  6. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
  7. Fool me twice: Entailment from Wikipedia gamification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 352–365.
  8. Summarize-then-answer: Generating concise explanations for multi-hop reading comprehension. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6064–6080.
  9. Exploring listwise evidence reasoning with t5 for fact verification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 402–410.
  10. Yichen Jiang and Mohit Bansal. 2019. Avoiding reasoning shortcuts: Adversarial evaluation, training, and model development for multi-hop qa. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2726–2736.
  11. Hover: A dataset for many-hop fact extraction and claim verification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3441–3460.
  12. Generating fluent fact checking explanations with unsupervised post-editing. Inf., 13(10):500.
  13. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  14. Neema Kotonya and Francesca Toni. 2020. Explainable automated fact-checking for public health claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7740–7754.
  15. A multi-level attention model for evidence-based fact checking. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2447–2460.
  16. The science of fake news. Science, 359(6380):1094–1096.
  17. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
  18. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  19. Fine-grained fact verification with kernel graph attention network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7342–7351, Online. Association for Computational Linguistics.
  20. QCRI’s COVID-19 Disinformation Detector: A System to Fight the COVID-19 Infodemic in Social Media. arXiv:2204.03506 [cs].
  21. Multi-hop fact checking of political claims. In IJCAI, pages 3892–3898. ijcai.org.
  22. Training language models to follow instructions with human feedback.
  23. Fact-checking complex claims with program-guided reasoning.
  24. Improving language understanding by generative pre-training.
  25. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  26. "why should I trust you?": Explaining the predictions of any classifier. In KDD, pages 1135–1144. ACM.
  27. Towards debiasing fact verification models. In EMNLP.
  28. Gautam Kishore Shahi and Durgesh Nandini. 2020. Fakecovid- A multilingual cross-domain fact check news dataset for COVID-19. In ICWSM Workshops.
  29. Sofia University “St. Kliment Ohridski”, Bulgaria, Pepa Gencheva, Preslav Nakov, Qatar Computing Research Institute, HBKU, Qatar, Lluís Màrquez, Qatar Computing Research Institute, HBKU, Qatar, Alberto Barrón-Cedeño, Qatar Computing Research Institute, HBKU, Qatar, Ivan Koychev, and Sofia University “St. Kliment Ohridski”, Bulgaria. 2017. A Context-Aware Approach for Detecting Worth-Checking Claims in Political Debates. In RANLP 2017 - Recent Advances in Natural Language Processing Meet Deep Learning, pages 267–276. Incoma Ltd. Shoumen, Bulgaria.
  30. Dominik Stammbach and Elliott Ash. 2020. e-fever: Explanations and summaries for automated fact checking. Proceedings of the 2020 Truth and Trust Online (TTO 2020), pages 32–43.
  31. FEVER: a large-scale dataset for fact extraction and verification. In NAACL-HLT, pages 809–819. Association for Computational Linguistics.
  32. Evaluating adversarial attacks against multiple fact verification systems. In EMNLP.
  33. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online. Association for Computational Linguistics.
  34. William Yang Wang. 2017. “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 422–426, Vancouver, Canada. Association for Computational Linguistics.
  35. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  36. Menatqa: A new dataset for testing the temporal comprehension and reasoning abilities of large language models.
  37. Answering complex open-domain questions with multi-hop dense retrieval. In ICLR. OpenReview.net.
  38. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380.
  39. Reasoning over semantic-level graph for fact checking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6170–6180.
  40. Gear: Graph-based evidence aggregating and reasoning for fact verification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 892–901.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Huanhuan Ma (10 papers)
  2. Weizhi Xu (13 papers)
  3. Yifan Wei (20 papers)
  4. Liuji Chen (4 papers)
  5. Liang Wang (512 papers)
  6. Qiang Liu (405 papers)
  7. Shu Wu (109 papers)
Citations (9)