FEVER: a large-scale dataset for Fact Extraction and VERification (1803.05355v3)

Published 14 Mar 2018 in cs.CL

Abstract: In this paper we introduce a new publicly available dataset for verification against textual sources, FEVER: Fact Extraction and VERification. It consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo by annotators achieving 0.6841 in Fleiss $\kappa$. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. To characterize the challenge of the dataset presented, we develop a pipeline approach and compare it to suitably designed oracles. The best accuracy we achieve on labeling a claim accompanied by the correct evidence is 31.87%, while if we ignore the evidence we achieve 50.91%. Thus we believe that FEVER is a challenging testbed that will help stimulate progress on claim verification against textual sources.

Authors (4)

James Thorne (48 papers)
Andreas Vlachos (70 papers)
Christos Christodoulopoulos (15 papers)
Arpit Mittal (15 papers)

Citations (1,440)

View on Semantic Scholar

Summary

An Overview of the FEVER Dataset for Fact Extraction and Verification

In the domain of NLP, verifying textual claims against large-scale textual sources such as Wikipedia has become increasingly vital in various applications — from journalism to scientific publication review. The paper "FEVER: a large-scale dataset for Fact Extraction and VERification," authored by James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal, introduces the FEVER dataset, a significant advancement in the creation of benchmark datasets for the task of claim verification against textual sources.

Dataset Composition and Characteristics

The FEVER dataset comprises 185,445 claims derived from sentences in Wikipedia. Claims are generated through a two-step process: extraction and manual mutation of information from Wikipedia entries, followed by a verification step where these claims are classified as Supported, Refuted, or NotEnoughInfo. A key feature and strength of the dataset is its annotation process, where annotators are required to provide specific evidence from Wikipedia to support or refute claims, thus contrasting it with other datasets that often overlook this critical aspect.

Annotation and Validation

The dataset's annotation involved constructing claims that are both genuine and synthetically altered. Annotators generated these claims and then labeled them with the appropriate classification along with necessary evidence. Inter-annotator agreement achieved a Fleiss κ score of 0.6841, indicating substantial reliability in annotations. Additionally, precision and recall metrics of 95.42% and 72.36%, respectively, were accomplished during annotation validation against "super-annotators," underscoring the robustness and integrity of the data.

Pipeline Approach for Verification

The authors propose a pipeline system to address the challenge posed by FEVER. This system retrieves relevant documents, selects pertinent sentences, and performs Recognizing Textual Entailment (RTE) to validate claims. The pipeline employs the document retrieval component from DrQA and a decomposable attention model for RTE, achieving an accuracy of 31.87% on the test set when correct evidence retrieval is considered, and 50.91% otherwise. This performance demonstrates that while challenging, the task is feasible under current computational constraints.

Document and Sentence Retrieval

The document retrieval module of the pipeline leverages TF-IDF similarity to find the k-nearest documents, showing that retrieval of the top 100 documents yields an oracle accuracy of 91.06%. For sentence selection, TF-IDF similarities were also applied, reducing the sentences to a manageable subset for RTE models. These components illustrate critical phases where improvement in accuracy can significantly impact overall performance.

Analysis and Future Directions

Through comprehensive manual error analysis, the authors identified that failures in information retrieval accounted for 58.27% of the errors, suggesting substantial room for improvement in this component. Additionally, the high-performance rate was displayed by the baseline Decomposable Attention model, highlighting the data’s efficacy in training intricate model architectures.

The implications of this work are both practical and theoretical. Practically, FEVER sets a new standard for datasets in claim verification, stimulating further research and development in more sophisticated NLP models. Theoretically, FEVER could lead to improvements in multi-hop reasoning and evidence integration in textual entailment, pushing the boundaries of what current models can achieve.

Conclusion

The FEVER dataset offers a challenging and nuanced testbed for verification against large-scale textual sources, providing a foundational resource for advancing claim verification systems. While the pipeline approach outlined in the paper establishes a solid initial framework, the gap identified in evidence retrieval underscores a critical area for future research. The dataset's rigorous validation and strong annotator agreement metrics ensure that it will remain a seminal benchmark for ongoing and future developments in claim extraction and verification.

PDF Markdown

Related Papers

YouTube

Show All Videos