- The paper found that annotation errors in benchmarks can mislead evaluations, with about 16% of examples mislabelled or ambiguous.
- It demonstrates that few-shot prompted LLMs significantly boost performance, establishing strong baselines on corrected data.
- It reveals that smaller fact verifiers struggle with complex multi-hop reasoning, but synthetic data generation can bridge this performance gap.
Fact verification models are critical for ensuring the reliability of LLMs by checking if generated information is supported by a given source. This paper evaluates 12 pre-trained LLMs and one specialized fact-verifier (MiniCheck 7B) using a collection of examples from 14 fact-checking benchmarks to provide insights for building more robust fact verifiers. The findings highlight issues in current evaluation practices and models, proposing concrete steps and data generation methods for improvement.
The paper's first key finding emphasizes the significant impact of annotation errors and ambiguity in benchmark datasets on model evaluations. Using a systematic pipeline combining LLMs-as-a-judge and targeted human annotation, the researchers found that approximately 16% of examples in their initial data collection were either ambiguous or mislabeled. Neglecting these issues can lead to misleading conclusions about model performance and rankings. For practical implementation, this suggests that developers should incorporate data cleaning and validation steps into their evaluation pipelines. The paper's pipeline, which involves different LLM judges assessing completeness, logical coherence, and faithfulness of model outputs, provides a scalable approach to identify potentially problematic instances for human review, drastically reducing the manual annotation load. This refined data was used to create ClearFacts (corrected and unambiguous examples) and GrayFacts (ambiguous examples), demonstrating that models perform poorly and inconsistently on ambiguous data.
The second finding points to few-shot prompted frontier LLMs as strong, often overlooked baselines. Evaluating models on ClearFacts, the researchers found that providing a small set of in-context examples significantly improved performance across most LLMs. Few-shot o1 achieved the highest overall performance. This highlights a simple yet effective strategy for maximizing the performance of large pre-trained models on fact verification tasks. Practitioners evaluating new fact verification methods or models should include these few-shot LLMs in their comparisons to set a competitive benchmark. Implementing few-shot prompting typically involves concatenating a few example input-output pairs to the model's prompt before providing the test instance.
The third finding reveals a limitation of smaller, fine-tuned fact verifiers: they struggle with instances requiring complex reasoning, such as those from benchmarks like Hover (multi-hop reasoning) and CoverBench (complex document formats). While specialized models like MiniCheck 7B are efficient, a notable performance gap exists compared to top-performing large models on these challenging examples. The paper argues for the necessity of developing small yet powerful fact verifiers, particularly because they are used widely not just for evaluation but also as reward models in factuality tuning frameworks for LLMs (e.g., using methods like DPO). Running large, costly frontier LLMs for reward computation across numerous training instances is often impractical.
To address this gap, the paper introduces a method for generating synthetic multi-hop fact verification data and demonstrates its effectiveness. The process involves using LLMs to extract facts and generate related questions based on documents (like Wikipedia), then using retrieval-augmented generation to answer, creating chains of facts and supporting documents. Statements are constructed based on these chains, including negative examples generated by removing documents or introducing contradictions. Training a small model (Llama 3.1 8B Inst) on this synthetic data, in addition to existing datasets like ANLI, significantly enhanced its performance on benchmarks requiring complex reasoning (CoverBench, Hover), showing potential for bridging the performance gap with larger models. For implementation, generating such synthetic data involves orchestrating calls to LLMs and potentially retrieval systems, followed by fine-tuning a smaller base model on the generated pairs of (document, statement, label/reasoning). Multi-task training (predicting direct answer or CoT) was also explored, contributing to improved performance.
In summary, the paper provides crucial practical guidance: prioritize data quality through systematic cleaning, benchmark against strong few-shot LLM baselines, and leverage synthetic data generation to improve the complex reasoning capabilities of smaller, more efficient fact verifier models needed for wide deployment. The code, model, and dataset from this paper are released to support these efforts.