Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers (2311.09000v3)

Published 15 Nov 2023 in cs.CL
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Abstract: The increased use of LLMs across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.

Analysis of "Factcheck-GPT: End-to-End Fine-Grained Document-Level Fact-Checking and Correction of LLM Output"

The paper "Factcheck-GPT: End-to-End Fine-Grained Document-Level Fact-Checking and Correction of LLM Output" addresses the critical need for verifying the factual accuracy of outputs generated by LLMs like ChatGPT. This is particularly pertinent given the prevalence of factual errors and hallucinations in LLM outputs, which undermine their real-world applicability. The authors propose an end-to-end framework designed to detect and correct such inaccuracies, operating at a fine granularity level—claims, sentences, and document—enabling a nuanced and precise fact-checking process.

Core Contributions

The authors present several notable contributions:

  1. Pipeline Framework: They introduce a comprehensive fact-checking pipeline for LLM outputs consisting of multiple stages: decomposition, decontextualization, check-worthiness identification, evidence retrieval, stance detection, correction determination, and claim correction. This pipeline allows for systematic processing of text to identify and correct factual errors at a granular level.
  2. Benchmark Dataset: The authors develop a document-level factuality benchmark comprising 94 ChatGPT-generated response pairs. This benchmark is structured to support different verification levels, providing a robust basis for evaluating and enhancing fact-checking methods used with LLMs.
  3. Annotation Tool: An annotation tool was designed to efficiently construct the factuality benchmark by supporting flexible customization of annotations with semi-automated assistance.
  4. Evaluation of Existing Tools: The paper evaluates existing fact-checking tools such as FacTool and FactScore on their dataset, highlighting significant gaps, especially in detecting false claims with the best F1 score at only 0.53.

Results and Discussion

The initial evaluation demonstrates considerable room for improvement, as current systems struggle with precisely identifying and correcting false claims. The systemic benchmarks and methodical approaches underscore the complexities involved in automating fact-checking against large-scale, multi-layered model outputs such as those from LLMs.

The authors argue for precise corrections, claiming that their fine-grained approach surpasses previous efforts that either do not correct errors or do so imprecisely with respect to claim-level fidelity. The proposed framework could, theoretically, produce clean corrected outputs while preserving the text's original intent and stylistic continuity.

Implications and Future Directions

Practically, a successful implementation of this framework could be pivotal in applications dependent on LLMs, where high factual accuracy is required. Theoretically, the research refines the methodology for automating fact-checking by emphasizing a decomposed approach, which can converge varied sub-problems needing specialized techniques.

The paper points toward future work possibilities in refining retrieval techniques for evidence and enhancing the granularity with which LLMs’ claims are verified against diverse, authoritative datasets. An intriguing direction would be integrating this framework with real-time data sources to further bolster accuracy and relevance.

Furthermore, the performance limitations highlighted indicate an opportunity for future models to integrate metadata processing, multi-document verification, and enhanced natural language understanding to overcome current challenges demonstrated by existing systems.

Overall, this paper ventures into critical territory, aiming to refine and advance the factual robustness of LLM outputs through meticulous, layered verification processes. As the field continues to grow, such research underpins the foundational need for automated credibility and reliability in AI-generated text.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Yuxia Wang (41 papers)
  2. Revanth Gangi Reddy (25 papers)
  3. Zain Muhammad Mujahid (7 papers)
  4. Arnav Arora (24 papers)
  5. Aleksandr Rubashevskii (7 papers)
  6. Jiahui Geng (24 papers)
  7. Osama Mohammed Afzal (9 papers)
  8. Liangming Pan (59 papers)
  9. Nadav Borenstein (13 papers)
  10. Aditya Pillai (4 papers)
  11. Isabelle Augenstein (131 papers)
  12. Iryna Gurevych (264 papers)
  13. Preslav Nakov (253 papers)
Citations (22)
Youtube Logo Streamline Icon: https://streamlinehq.com