ME-FEVER Dataset for Multi-Evidence Verification
- ME-FEVER is a dataset that benchmarks multi-evidence hallucination detection in NLP using adversarial evidence pools.
- It augments FEVER by incorporating synthetic passages, including completely irrelevant, partially irrelevant, and highly related yet misleading evidence.
- Evaluation metrics such as label accuracy, evidence-matching rate, and critique quality show HALU-J achieving state-of-the-art performance.
ME-FEVER is a dataset specifically constructed for benchmarking multiple-evidence hallucination detection in natural language processing. Developed as part of the “HALU-J: Critique-Based Hallucination Judge” framework, ME-FEVER operationalizes multi-passage factuality assessment using complex, distractor-rich evidence pools and detailed annotation protocols. It builds on the FEVER corpus but fundamentally expands the evidence structure to support real-world conditions where candidate evidence is heterogeneous in relevance and intent (Wang et al., 2024).
1. Dataset Construction and Evidence Augmentation
ME-FEVER is derived from the original FEVER dataset, which consists of crowd-annotated claims from Wikipedia, each paired with single gold-standard evidence. ME-FEVER introduces synthetic multi-evidence pools per claim using a systematic LLM augmentation pipeline:
- For each FEVER claim and its gold evidence :
- Completely Irrelevant (): Two passages sampled from unrelated FEVER articles, manually verified as truly irrelevant.
- Partially Irrelevant (): Four paragraphs generated on the same nominal topic but providing no direct support or contradiction; paragraphs are lengthened to approximately 150 words using GPT-3.5-Turbo.
- Highly Related but Misleading (): One to three passages crafted to be on-topic, non-contradictory with , lacking direct support/refutation, but intentionally containing confusing details designed to induce misclassification.
No external retrieval system is involved; all passages are created and filtered in the loop with LLMs (mainly GPT-4-Turbo), and outputs violating the augmentation constraints are manually removed. Each of the 3,901 final instances preserves its FEVER label: True, False, or Neutral.
2. Annotation Schema and Human Filtering
Evidence categorization is strictly defined:
- Eº (Completely Irrelevant): No topical connection to the claim.
- EP (Partially Irrelevant): Passages that match the topic or style but provide no verification.
- ET (Highly Related/Misleading): Designed to interfere with detector inference by remaining on-topic and ambiguous.
Claim veracity labels are inherited from FEVER and presented as True, False, or Neutral. Human filtering consists of author spot-checks for each synthetic passage to exclude contradiction artifacts and duplicates. No broad crowdsourced relabeling or explicit inter-annotator agreement (e.g., Cohen’s ) is performed, as labels are either inherited or machine-generated then author-filtered.
3. Dataset Statistics and Structure
Key dataset attributes are:
| Metric | Value | Notes |
|---|---|---|
| Instances (total) | 3,901 | All contain multi-evidence setup |
| Training split | 2,663 | |
| Test split | 1,238 | |
| Evidence per instance | 2 , 4 , 1–3 | 0 pieces per claim (mean) |
| Veracity label distribution | Approx. balanced | (True / False / Neutral; precise counts not given) |
| Disk size | ~10 MB (compressed) | ~30 MB uncompressed JSONL |
Instances are formatted as .jsonl files, with each instance containing an "id", "claim", FEVER-inherited "label", and an "evidence" array (each entry with "text" and "category").
Example schema: 0
4. Evaluation Tasks and Metrics
Evaluation on ME-FEVER targets several dimensions:
- Multi-evidence hallucination detection: Given a claim 1 and evidence set 2, predict the claim veracity (True/False/Neutral) and generate a stepwise natural language critique. Metric: Accuracy 3.
- Evidence matching: Assign each 4 to its proper evidence category (5, 6, or 7). Metric: Evidence-matching rate 8.
- Critique generation quality: Generate natural-language critiques, scored by GPT-4-Turbo (1–100), using a rubric that weights relevance, faithfulness, completeness, and logic.
- Optional standard verification: Apply retrieval-augmented verification protocols (as in FEVER, ANLI, WANLI, HaluEval, KBQA) in a single-evidence regime with output dictionaries.
Common metric definitions apply: 9
5. Baseline Model Performance
Performance on the ME-FEVER test set is reported for multiple model architectures and configurations:
| System | Label Accuracy | Evidence-Matching Rate | Critique Quality (1–100) |
|---|---|---|---|
| GPT-4o | 0.83 | 61.4% | 85.9 |
| GPT-3.5-Turbo | 0.81 | 59.3% | 72.4 |
| Mistral-7B | 0.78 | 51.2% | 61.3 |
| HALU-J (w/o DPO) | 0.90 | 66.9% | 82.6 |
| HALU-J | 0.91 | 68.1% | 83.9 |
HALU-J achieves state-of-the-art results across all metrics, outperforming GPT-4o both in accuracy and in evidence classification. This suggests the challenge posed by ME-FEVER requires targeted evidence selection and sophisticated critique generation, as demonstrated by HALU-J’s performance (Wang et al., 2024).
6. Relation to Prior Work and Benchmarking Significance
ME-FEVER's multi-evidence structure fundamentally extends the verification regime of FEVER by introducing high-density, adversarial evidence pools per claim. Unlike prior single-evidence datasets and hallucination detection tasks, ME-FEVER requires solvers to discriminate not just veracity, but evidence relevance under adversarially designed distractors. No retrieval augmentation is performed, eliminating potential confounds from noisy search ranks.
This approach enables more realistic simulation of downstream use-cases in fact verification, QA hallucination detection, and critique-based evaluation pipelines. It is the first publicly released benchmark to support hallucination detection over multi-source, confounding evidence pools (Wang et al., 2024).
7. Access, Licensing, and Usage
ME-FEVER is distributed under the Apache 2.0 License, inherited from FEVER and Factool. Dataset and baseline code are available at https://github.com/GAIR-NLP/factool. The canonical citation is:
Wang B., Chern S., Chern E., Liu P. (2024). HALU-J: Critique-Based Hallucination Judge. (Wang et al., 2024).
A plausible implication is that ME-FEVER will play a central role in advancing and evaluating critique-based hallucination judgers and multi-evidence fact verification models.