VeriFastScore: Speeding up long-form factuality evaluation (2505.16973v2)

Published 22 May 2025 in cs.CL

Abstract: Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim. While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore. To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.

Summary

Speeding Up Long-Form Factuality Evaluation with vfsblue

In the paper "Speeding Up Long-Form Factuality Evaluation," the authors present vfsblue, a novel system designed to efficiently evaluate the factuality of long-form text generation. This paper targets the limitations of existing methodologies like vsgray, which employ a sequential approach to dissect and verify each claim within a generated text. These pipelines, while delivering precision in factual evaluation, are hindered by computational inefficiency, a major bottleneck in large-scale applications.

Methodology

vfsblue diverges from traditional multi-step models by integrating claim decomposition and verification into a single model pass. It leverages synthetic datasets to fine-tune Llama3.1 8B, enabling it to process an entire response alongside comprehensive evidence obtained through Google Search. This model receives substantial amounts of tokenized evidence, averaging around 4,000 tokens, and performs concurrent claim decomposition and verification against this potentially noisy data. Importantly, vfsblue operates without the prior extraction of claims during evidence retrieval. Instead, it utilizes sentence-level queries to Google Search for evidence gathering, which it then consolidates to validate the response.

Results

The approach achieves remarkable results, evidenced by a correlation coefficient of $r=0.80$ at the example level and $r=0.94$ at the system level compared to outputs from the vsgray pipeline. Moreover, vfsblue provides a dramatic speedup; it is 6.6 times faster overall and 9.9 times faster in model processing (exclusive of evidence retrieval) than vsgray. This performance does not compromise the precision and interpretability of the factuality scores and claim verifiability it provides.

Implications and Future Directions

The practical implications of this research are significant, offering a scalable solution for factual evaluation in training paradigms like RLHF and large-scale data analytics. The use of consolidated evidence contexts instead of individualized claim-based retrieval also simplifies API usage significantly, reducing associated costs. As promising as this model is, there remain challenges, including the potential for claim omissions and the need for further validation of its label robustness in real-world scenarios.

Moving forward, the research community could explore enhancements such as adaptive retrieval strategies that dynamically query for evidence based on preliminary evaluations and refine the claim extraction process to mitigate any positional biases observed. Additionally, integrating rationale generation to accompany factuality labels could enhance the interpretability and user trust in the outputs, especially in domains demanding high accuracy.

Conclusion

The development of vfsblue represents a significant advancement in AI-driven factuality evaluation, addressing the time and resource constraints inherent in prior models while maintaining high correlation with established benchmarks. By releasing the model and datasets publicly, the authors enable further exploration and optimization of this approach, fostering advancements in efficient and scalable factual verification techniques in AI systems.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (5)

Tweets

https://twitter.com/rishanthrajendh/status/1933177626090467431