Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation (2406.19276v1)

Published 27 Jun 2024 in cs.CL

Abstract: Existing metrics for evaluating the factuality of long-form text, such as FACTSCORE (Min et al., 2023) and SAFE (Wei et al., 2024), decompose an input text into "atomic claims" and verify each against a knowledge base like Wikipedia. These metrics are not suitable for most generation tasks because they assume that every claim is verifiable (i.e., can plausibly be proven true or false). We address this issue with VERISCORE, a metric for diverse long-form generation tasks that contain both verifiable and unverifiable content. VERISCORE can be effectively implemented with either closed or fine-tuned open-weight LLMs, and human evaluation confirms that VERISCORE's extracted claims are more sensible than those from competing methods across eight different long-form tasks. We use VERISCORE to evaluate generations from 16 different models across multiple long-form tasks and find that while GPT-4o is the best-performing model overall, open-weight models such as Mixtral-8x22 are closing the gap. We show that an LM's VERISCORE on one task (e.g., biography generation) does not necessarily correlate to its VERISCORE on a different task (e.g., long-form QA), highlighting the need for expanding factuality evaluation across tasks with varying fact density.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yixiao Song (11 papers)
  2. Yekyung Kim (8 papers)
  3. Mohit Iyyer (87 papers)
Citations (10)

Summary

  • The paper presents VeriScore as its main contribution, a metric that focuses exclusively on verifiable claims to overcome limitations of earlier methods like FActScore and SAFE.
  • It employs few-shot learning and fine-tuned GPT-4 alongside Google search verification via the Serper API, with human evaluation confirming its effectiveness.
  • Benchmarking across 16 different LLMs, including GPT-4o and open models, demonstrates its capability to differentiate factual performance across diverse long-form tasks.

An Expert Overview of "VeriScore: Evaluating the factuality of verifiable claims in long-form text generation"

The paper "VeriScore: Evaluating the factuality of verifiable claims in long-form text generation" presents an authoritative metric for assessing the factuality of long-form text outputs from LLMs. This metric, called VeriScore, introduces a nuanced approach to factuality evaluation that competently differentiates between verifiable and unverifiable content, addressing a crucial shortcoming in previous metrics like FActScore and SAFE.

Key Contributions

  1. Introduction of VeriScore: The paper introduces VeriScore, a metric designed to evaluate the factuality of diverse long-form text generation tasks by focusing exclusively on verifiable claims. Existing metrics such as FActScore and SAFE are limited by their assumption that every claim in the input text is verifiable, an assumption that does not hold true for many complex generation tasks like long-form question answering (LFQA). VeriScore overcomes this limitation by extracting only verifiable claims for evaluation.
  2. Claim Extraction and Verification: VeriScore refines the approach to decomposing text into verifiable claims. Unlike previous methods, it incorporates few-shot learning prompts to fine-tune LLMs like GPT-4 for more effective claim extraction and verification. The extracted claims are verified against search results from Google via the Serper API, ensuring that only verifiable content is assessed. This process is validated through extensive human evaluation, confirming the efficacy of both extraction and verification methods.
  3. Benchmarking Across Diverse Models and Domains: The paper benchmarks VeriScore across 16 different LLMs, including closed models like GPT-4 and Claude 3 and open-weight models such as Mixtral-8x22B. These models are evaluated across various fact-seeking and creative domains, from biography generation to response generation to prompts from the ShareGPT dataset. The comprehensive evaluation highlights the superior factuality performance of the closed model GPT-4o and identifies the narrowing gap with advanced open-weight models like Mixtral-8x22B.
  4. Human Evaluation Studies: The paper includes thorough human evaluation studies for claim extraction and verification, revealing substantial improvements over existing methods. The studies confirm that the claims extracted by VeriScore are more sensible and contextually accurate compared to those from SAFE, especially in domains that involve complex, long-form tasks.

Numerical Results and Bold Claims

  • Performance Metrics: VeriScore consistently delivers superior performance metrics across various tasks and domains. GPT-4o emerged as the best-performing model overall with an average VeriScore of 66.5, demonstrating its capability to generate highly factual long-form text.
  • Inter-Task Variability: The paper highlights that an LLM's factuality performance can vary significantly across different tasks. For instance, GPT-4o's performance on biography generation does not necessarily correlate with its performance on long-form QA, emphasizing the need for multi-task evaluation frameworks.
  • Open-Weight Models Closing the Gap: VeriScore's analyses reveal that while GPT-4o leads in factuality, open-weight models like Mixtral-8x22B are increasingly competitive, with these models showing a robust improvement in their factual Veriscore scores across tasks.

Implications and Future Developments

  1. Practical Implications: VeriScore has substantial practical implications for the deployment of LLMs in applications where factual accuracy is critical, such as educational tools, automated reporting, and scientific content generation. By focusing on verifiable claims, it ensures a more reliable assessment of model outputs, thereby enhancing user trust in LLM-generated content.
  2. Theoretical Implications and Model Training: From a theoretical standpoint, the paper underscores the importance of training models on diverse tasks and domains to improve their generalization capabilities in factuality. This insight could guide future developments in model training and evaluation, where multi-domain datasets and nuanced evaluation metrics like VeriScore become a standard.
  3. Future Research Directions: The paper sets the stage for future developments in AI, particularly in advancing retrieval-based verification methods and improving the interpretability of claim extraction and verification processes. Future research could explore the integration of more sophisticated search and reasoning capabilities to address the limitations identified by VeriScore, such as handling highly complex or context-dependent claims.

In conclusion, VeriScore represents a significant advancement in the evaluation of long-form text generation, providing a robust, scalable, and contextually nuanced approach to assessing the factuality of LLM outputs. The rigorous methodology and comprehensive benchmarking presented in the paper underscore its value to the research community, setting a new standard for factuality evaluation in AI-generated content.