FEVER Dataset: Fact Verification Benchmark
- FEVER is a large-scale benchmark for fact verification, featuring 185,445 human-generated claims paired with Wikipedia evidence.
- The dataset establishes clear claim categorization (Supported, Refuted, NotEnoughInfo) with minimal evidence annotations requiring multi-hop reasoning.
- Standardized evaluation metrics and baseline systems in FEVER drive advancements in natural language inference and explainable AI.
The FEVER (Fact Extraction and VERification) dataset is a large-scale, publicly available benchmark designed to advance the development and evaluation of automated systems for the verification of factual claims against textual evidence drawn from Wikipedia. FEVER establishes a precise task formulation centered on verifying whether a given natural-language claim can be supported, refuted, or remains unanswerable based on explicit evidence. By providing a rigorously annotated corpus of 185,445 human-generated claims, extensive evidence annotations, and standardized evaluation metrics, FEVER serves as a foundational resource for research in natural language inference, fact verification, interpretable entailment, and multi-hop retrieval.
1. Dataset Construction and Annotation Protocol
Claim Generation
Claims in FEVER are derived through a multi-stage, human-centered protocol. Annotators begin with facts extracted from the introductory sections of sampled Wikipedia pages (∼50,000 from a June 2017 dump, processed via CoreNLP) (Thorne et al., 2018). Each source sentence is either paraphrased or mutated according to defined operators: negation, entity swapping with semantically similar or dissimilar mentions, and adjustment of specificity (either entailment-preserving or reversing). These controlled mutations yield diverse claims whose veracity is ambiguous a priori and may require complex reasoning to adjudicate.
Claim Categorization and Evidence Annotation
Each claim is independently labeled as one of Supported, Refuted, or NotEnoughInfo (NEI), based on whether the claim can be substantiated, contradicted, or remains unresolved given the totality of Wikipedia evidence (Thorne et al., 2018). Annotators, blinded to the original source sentence, identify one or more minimal sets of sentences in Wikipedia such that their union is sufficient (and no proper subset is sufficient) to justify their judgment. These minimal evidence sets may draw from multiple Wikipedia pages, and annotators are instructed to provide alternative evidence groupings if available.
Annotation Guidelines and Quality Control
Annotation guidelines stipulate that only those sentences directly necessary to establish the label be included, and that each evidence set be both minimal and complete. Inter-annotator agreement is measured via Fleiss’ κ, achieving 0.6841 on a 5-way sample (4% of data), indicating substantial agreement (Thorne et al., 2018). Super-annotator validation yields a retrieval precision of 95.42% and recall of 72.36% compared to exhaustive expert annotation, with 91.2% correct label+evidence combinations in manual evaluation. Early trade-offs favoring annotation throughput were later mitigated by supplementary annotation rounds to augment the gold standard for challenging claims.
2. Dataset Composition and Splitting Strategy
FEVER comprises 185,445 claims, rigorously stratified into training, development, and test sets. The splits are constructed to ensure disjointness with respect to source Wikipedia pages, preventing potential overlap in background knowledge (Thorne et al., 2018).
| Split | Supported | Refuted | NotEnoughInfo | Total |
|---|---|---|---|---|
| Training | 80,035 | 29,775 | 35,639 | 145,449 |
| Development | 6,666 | 6,666 | 6,666 | 19,998 |
| Test | 6,666 | 6,666 | 6,666 | 19,998 |
- Training set is naturally imbalanced (∼47% Supported, 18% Refuted, 21% NEI).
- Dev/test sets are fully balanced (each class exactly one-third).
The average claim consists of 9.4 tokens (comparable to other entailment corpora). In terms of evidence annotation, 16.82% of claims require the union of more than one sentence, with an average of 1.17 sentences per claim (weighted by the single- and multi-sentence evidence proportions). Evidence from multiple Wikipedia pages is required for 12.15% of claims (Thorne et al., 2018).
3. Evidence Retrieval, Reasoning, and Annotation Challenges
Evidence selection for each claim is performed at annotation time, with annotators permitted to search or browse Wikipedia and select sentences from any requisite page. The requirement that evidence sets be minimal and complete introduces significant complexity—multi-hop reasoning (across discontiguous sentences and pages) and entity disambiguation abound. Overall, 31.75% of claims (within Supported/Refuted) require more than one sentence for complete evidential justification, and ∼17% of all claims necessitate true multi-hop aggregation (Thorne et al., 2018). The annotation pipeline was designed to favor throughput, sacrificing some recall of all possible valid evidence sets, with later augmentation steps to increase gold standard coverage for evaluation.
Key annotation challenges include:
- Multi-hop inference over non-contiguous or cross-page facts
- Semantic ambiguity and pronoun/ellipsis resolution
- Incompleteness of evidence enumeration due to the open-domain and evolving nature of Wikipedia
4. Evaluation Metrics and Scoring Formulations
FEVER introduces a rigorous set of evaluation metrics capturing both prediction correctness and evidential completeness:
Label Accuracy
Fraction of claims with correctly predicted labels.
Evidence-based Precision, Recall, and F₁
Union over all predicted and gold sentence-level evidence annotations is compared:
- Precision:
- Recall:
- F₁:
FEVER Score
where iff and .
This metric strictly requires both the correct label and at least one complete minimal gold evidence set; NEI predictions require only correct labeling. The FEVER score penalizes systems that fail to justify Supported/Refuted labels with sufficient, minimal evidence, enforcing interpretable, explainable predictions (Thorne et al., 2018).
5. Baseline Systems and Shared Task Outcomes
A pipeline architecture is specified as the canonical baseline: (1) document retrieval (DrQA’s TF–IDF, cosine similarity), (2) sentence selection (ranking sentences by TF–IDF similarity), and (3) recognizing textual entailment (RTE) using a single-layer MLP or Decomposable Attention (Parikh et al.). NEI labels are generated by pairing claims with random or nearest sentences (Thorne et al., 2018). Upper-bound oracle experiments show strong performance ceilings: 70.20% accuracy with perfect RTE and gold document retrieval (k=5). However, the best joint accuracy with correct evidence is 31.87%, while label-only accuracy reaches 50.91%.
The inaugural FEVER Shared Task (Thorne et al., 2018) attracted 23 competitive entries. Nineteen surpassed the published baseline; the best-performing system achieved a FEVER score of 64.21%. This outcome demonstrates substantial annotation and modeling bottlenecks, chiefly in evidence retrieval and multi-hop reasoning. Pipeline ablation studies indicate that omitting sentence selection reduces performance by approximately 10 absolute points.
6. Extensions, Applications, and Influence
FEVER is the progenitor of an entire paradigm for fact verification and explainable entailment in open-domain settings. Its core design—intertwining minimal supporting/refuting evidence with claim-level labels—forms the archetype for numerous follow-on datasets in domain-specialized (e.g., CLIMATE-FEVER (Diggelmann et al., 2020), COVID-Fact (Saakyan et al., 2021)), multilingual (e.g., CFEVER (Lin et al., 2024), Poly-FEVER (Zhang et al., 19 Mar 2025), XFEVER (Chang et al., 2023)), and real-world claim contexts. All preserve or adapt the dual requirement for prediction and explicit citation of evidence.
Practical applications include:
- Open-domain claim verification and misinformation detection
- Multi-hop reading comprehension and retrieval model benchmarking
- Interpretable NLI/RTE system development
- Exploration of neural theorem-proving and natural logic
Notable limitations stem from Wikipedia’s domain coverage (restricted primarily to intros), incomplete evidence annotation, and the artificiality of some mutated claims. Nevertheless, FEVER remains a critical benchmark for developing robust, citation-aware automated fact-verification systems (Thorne et al., 2018, Thorne et al., 2018).
7. Legacy and Comparative Context
The FEVER dataset’s methodology—claim mutation, minimal evidence annotation, and joint correctness metrics—has become canonical in the field. Subsequent datasets extending this paradigm routinely emulate the FEVER protocol, varying only domain (scientific, climate, COVID-19), language (Chinese, multilingual), or annotation scalings. Comparative works highlight the challenges in transferring FEVER-honed models to real-world or specialized domains, revealing significant drops in accuracy and evidential sufficiency (e.g., CLIMATE-FEVER reports zero-shot FEVER model accuracy at 38.78%, F₁=32.85%, far below in-domain FEVER results) (Diggelmann et al., 2020).
The tri-label schema (Supported, Refuted, NotEnoughInfo), stringent evidence annotation, and disclosure of all annotation, code, and interface tools underpin its widespread adoption as a fact verification standard for NLI, fact-checking, and explainable AI research (Thorne et al., 2018, Thorne et al., 2018).