The Fact Extraction and VERification (FEVER) Shared Task (1811.10971v2)

Published 27 Nov 2018 in cs.CL

Abstract: We present the results of the first Fact Extraction and VERification (FEVER) Shared Task. The task challenged participants to classify whether human-written factoid claims could be Supported or Refuted using evidence retrieved from Wikipedia. We received entries from 23 competing teams, 19 of which scored higher than the previously published baseline. The best performing system achieved a FEVER score of 64.21%. In this paper, we present the results of the shared task and a summary of the systems, highlighting commonalities and innovations among participating systems.

PDF Abstract

Overview of the FEVER Shared Task Paper

The subject of this paper is the Fact Extraction and Verification (FEVER) shared task, an initiative designed to assess the capability of automatic systems in verifying human-written factual claims using evidence extracted from Wikipedia. The shared task represents a structured approach to fact-checking where participants were required to classify claims into three categories: Supported, Refuted, or NotEnoughInfo. Critical to this task was not only the correct classification of claims but also the retrieval of the necessary supporting or refuting evidence from a vast textual corpus.

Task Description and Dataset

The core objective for participants was to develop systems that could determine the veracity of textual claims against evidence obtained from Wikipedia. This required both the retrieval of relevant evidence and reasoning concerning the claim. In this context, the FEVER task distinguishes itself from other natural language inference and textual entailment tasks by necessitating the identification and interpretation of evidence from a large-scale corpus. Also noteworthy is the size of the dataset, consisting of 185,445 human-generated claims, hand-validated against parts of Wikipedia and categorized as either Supported, Refuted, or NotEnoughInfo.

Results and Observations

From the 23 participating teams, 19 surpassed the baseline performance benchmark, with the leading system achieving a FEVER score of 64.21%. This underscores the progression in automatic information verification capabilities. Presenting a breakdown of contributions, the shared task revealed various methodological commonalities among top-performing systems. Primarily, these systems adopted a multi-stage pipeline approach encompassing document retrieval, sentence selection, and natural language inference. Furthermore, innovative techniques such as multi-task learning and data augmentation played a role in improving the performance of specific systems.

Technical Contributions

Common technical approaches included using Named Entity Recognition (NER) and similarity matching for document and sentence selection. Some teams employed machine learning models like the Enhanced LSTM or Decomposable Attention models for classification tasks. A noteworthy implementation was the open-source scoring software provided by the organizers, enabling standardized evaluation through metrics such as precision, recall, and the FEVER score.

Implications and Future Directions

The FEVER shared task has significant theoretical and practical implications. It not only pushes the boundaries of fact-checking capabilities in natural language processing but also prompts further research in evaluating claim veracity based on incomplete or sparse datasets. For future iterations, expanding the dataset to enhance evidence coverage and integrating more sophisticated models capable of better generalizing across domains presents a promising direction.

Overall, the FEVER shared task has catalyzed interest and innovation in the domain of automated fact verification. By providing an accessible challenge with robust evaluation metrics, it sets a foundation for future advancements that could fundamentally enhance information verification processes across various applications.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

James Thorne (48 papers)
Andreas Vlachos (70 papers)
Oana Cocarascu (14 papers)
Christos Christodoulopoulos (15 papers)
Arpit Mittal (15 papers)

Citations (225)

View on Semantic Scholar