Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 164 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 72 tok/s Pro

Kimi K2 204 tok/s Pro

GPT OSS 120B 450 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts (2505.09701v1)

Published 14 May 2025 in cs.CL

Abstract: LLMs excel at generating long-form responses, but evaluating their factuality remains challenging due to complex inter-sentence dependencies within the generated facts. Prior solutions predominantly follow a decompose-decontextualize-verify pipeline but often fail to capture essential context and miss key relational facts. In this paper, we introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction by identifying and resolving incomplete and missing facts to support more accurate verification results. Moreover, we introduce FactRBench , a benchmark that evaluates both precision and recall in long-form model responses, whereas prior work primarily focuses on precision. FactRBench provides reference fact sets from advanced LLMs and human-written answers, enabling recall assessment. Empirical evaluations show that VeriFact significantly enhances fact completeness and preserves complex facts with critical relational information, resulting in more accurate factuality evaluation. Benchmarking various open- and close-weight LLMs on FactRBench indicate that larger models within same model family improve precision and recall, but high precision does not always correlate with high recall, underscoring the importance of comprehensive factuality assessment.

Summary

Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction

The paper "Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts" addresses a critical area of concern in the utilization of LLMs for generating long-form responses—the evaluation of factuality. This topic is of significant importance because, although LLMs have shown remarkable proficiency in generating coherent text, the verification of factual information remains a challenging endeavor due to the intricacies involved in inter-sentence dependencies and the need for broader contextual comprehension.

Challenges in Current Evaluation Methodologies

Existing methodologies predominantly employ a pipeline consisting of decomposition, decontextualization, and verification (decompose-decontextualize-verify). While these approaches strive to break down responses into simpler, verifiable units, they often fall short in capturing essential contextual dependencies and relational facts. The failure to account for these complex inter-sentence relationships can lead to incomplete fact extraction and inaccurate verifications, as highlighted by studies comparing benchmarks like SAFE and FactCheck-GPT.

Introduction of \method for Enhanced Fact Extraction

To bolster the factuality evaluation framework, the authors introduce a novel approach, \method, which focuses on refining the fact extraction process. \method is designed to identify and resolve incomplete and missing facts within long-form responses. By employing multiple LLM judges to detect issues, \method successfully flags problematic fact extraction and employs refinement techniques through self-contained correction. Importantly, this enhanced fact extraction preserves critical contextual and relational information, thereby leading to more robust verification outcomes.

Implementation of \bench for Comprehensive Evaluation

In addition to \method, the paper introduces \bench, a benchmark specifically crafted to facilitate the evaluation of both precision and recall in factual assessments. Unlike previous benchmarks that prioritize precision, \bench places an emphasis on recall by providing reference fact sets sourced from advanced LLMs and human-written answers. This dual focus allows for a comprehensive assessment of factuality, addressing the often-overlooked aspect of coverage in long-form responses.

Empirical Findings

The empirical evaluations presented in the paper demonstrate that \method significantly enhances fact completeness. Notably, there is a reduction in incomplete facts by 19.2% compared to SAFE, and a 37% decrease in missing facts following \method’s refinement. Furthermore, the benchmarking of various LLMs using \bench reveals nuanced insights into model performance:

Within the same model family, larger models tend to improve both precision and recall.
High precision does not always equate to high recall, underscoring the necessity for a balanced evaluation of factuality.
While closed-weight models show higher recall, the most sophisticated open-weight models exhibit competitive precision, indicating significant advancements in the capabilities of open-weight models.

Implications and Future Directions

The improvements brought by \method in handling complex factual relationships not only enhance theoretical understanding of factuality evaluation but also hold practical implications for the deployment of LLMs in real-world applications where the accuracy and comprehensiveness of information are paramount. Importantly, the introduction of \bench as a comprehensive benchmark aids researchers in evaluating their models with both precision and recall metrics, offering a more complete depiction of a model's factuality performance.

In terms of future directions, the paper suggests the potential exploration of efficient fact-checking mechanisms to further address computational challenges and extend usability in real-time applications. Additionally, expanding human-curated datasets for reference fact sets can help bolster recall measures, ensuring even more rigorous factuality evaluations.

This paper constitutes a significant contribution to enhancing the reliability and accuracy of factuality evaluation in LLM-generated content, paving the way for advancements in this critical aspect of AI model assessment.