Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation (2305.14251v2)

Published 23 May 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Evaluating the factuality of long-form text generated by LLMs (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong LLM, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via pip install factscore.

An Overview of FActScore: Evaluating Factual Precision in Text Generation

The paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation" addresses a significant challenge in the field of NLP: the evaluation of the factual correctness of texts generated by LLMs. It proposes a novel metric, FActScore, to address the inadequacies found in traditional binary evaluation methods and the constraints associated with human evaluations.

Problem Statement and Motivation

In text generation tasks, especially with long-form content, LLMs often produce a mix of correct and incorrect information. Binary measures, which judge the entire output as either factually correct or incorrect, fail to capture this nuance. Moreover, relying on human evaluators can be both time-consuming and financially prohibitive. The paper presents FActScore as a metric designed to quantify factual precision with finer granularity, thus providing more informative feedback about LLM outputs.

Methodology: FActScore Framework

FActScore disassembles a generated text into atomic facts and checks each piece against a reliable knowledge base, specifically the English Wikipedia. This scoring method provides a percentage of facts supported by this knowledge base, which gives a clearer picture of the factual reliability of the generated text. The authors conducted extensive human evaluations using this metric on biographies generated by leading NLP models such as InstructGPT, ChatGPT, and PerplexityAI, demonstrating significant differences in their factual accuracy. For instance, ChatGPT achieved an accuracy of only 58%.

In recognition of the need for scalable solutions, the authors also developed an automated model to estimate FActScore, achieving an error rate within 2% when compared to human evaluations. It uses a combination of retrieval techniques and evaluation by a potent LLM, enabling large-scale evaluation without prohibitive costs. This approach was applied to 6,500 text generations across multiple LLMs, revealing insights such as GPT-4 and ChatGPT's superiority in factual accuracy over public models like Vicuna and Alpaca.

Experimental Results and Implications

A key finding of this research was the variability in factual accuracy among different models. For high-profile models such as GPT-4, the paper showed that factual precision exceeded that of several publicly available counterparts, indicating a potential benchmark for future developments in text generation. These insights are crucial for both developers of LLMs and users who require reliable information synthesis from these models.

Implications and Future Directions

The introduction of FActScore has significant implications for the field of AI and NLP. Practically, it provides a tool that can be instrumental in assessing and improving the reliability of text generated by LLMs. Theoretically, it opens avenues for further research into fine-grained evaluation metrics and the development of models that can autonomously understand and enhance their factual accuracy.

The paper also highlights the potential for similar metrics in assessing other qualitative attributes of text generation, such as coherence and relevance. Future research could explore the adaptation of FActScore to other domains beyond biographies or experiment with different knowledge bases to accommodate various languages and cultural contexts.

In conclusion, while addressing the critical challenge of factual evaluation in text generation, the paper provides a framework that both practitioners and researchers can leverage to improve the factual accuracy of AI-generated content. The proposed methodologies and findings not only serve current technology but also lay a foundation for upcoming innovations in the field of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  3. Re-evaluating evaluation in text summarization. In Proceedings of Empirical Methods in Natural Language Processing.
  4. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373.
  5. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037.
  6. Language models are few-shot learners. In Proceedings of Advances in Neural Information Processing Systems.
  7. Reading Wikipedia to answer open-domain questions. In Proceedings of the Association for Computational Linguistics.
  8. Generating literal and implied subquestions to fact-check complex claims. In Proceedings of Empirical Methods in Natural Language Processing.
  9. Seeing things from a different angle:discovering diverse perspectives about claims. In Conference of the North American Chapter of the Association for Computational Linguistics.
  10. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  11. Towards question-answering as an automatic metric for evaluating the content quality of a summary. Transactions of the Association for Computational Linguistics.
  12. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
  13. Is gpt-3 a good data annotator? arXiv preprint arXiv:2212.10450.
  14. QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Conference of the North American Chapter of the Association for Computational Linguistics.
  15. Generating fact checking briefs. In Proceedings of Empirical Methods in Natural Language Processing.
  16. Attributed text generation via post-hoc research and revision. arXiv preprint arXiv:2210.08726.
  17. Enabling large language models to generate text with citations.
  18. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arxiv:2301.07597.
  19. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics.
  20. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems.
  21. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  22. Wice: Real-world entailment for claims in wikipedia. arXiv preprint arXiv:2303.01432.
  23. Large language models struggle to learn long-tail knowledge. arXiv preprint arXiv:2211.08411.
  24. LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. In Proceedings of the European Chapter of the Association for Computational Linguistics.
  25. Evaluating the factual consistency of abstractive text summarization. In Proceedings of Empirical Methods in Natural Language Processing.
  26. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics.
  27. Factuality enhanced language models for open-ended text generation. In Advances in Neural Information Processing Systems.
  28. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine.
  29. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval.
  30. Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848.
  31. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  32. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. arXiv preprint arXiv:2212.07981.
  33. Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges. In Proceedings of the Fourth Conference on Machine Translation.
  34. Expertqa: Expert-curated questions and attributed answers. arXiv preprint arXiv:2309.07852.
  35. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the Association for Computational Linguistics.
  36. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
  37. SemEval-2019 task 8: Fact checking in community question answering forums. In Proceedings of the 13th International Workshop on Semantic Evaluation.
  38. Nonparametric masked language modeling. In Findings of the Association for Computational Linguistics: ACL.
  39. Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims. In Experimental IR Meets Multilinguality, Multimodality, and Interaction.
  40. Ani Nenkova and Rebecca Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Conference of the North American Chapter of the Association for Computational Linguistics.
  41. Large dual encoders are generalizable retrievers. In Proceedings of Empirical Methods in Natural Language Processing.
  42. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375.
  43. OpenAI. 2022. Chatgpt blog post. https://openai.com/blog/chatgpt.
  44. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  45. Training language models to follow instructions with human feedback. In Proceedings of Advances in Neural Information Processing Systems.
  46. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Conference of the North American Chapter of the Association for Computational Linguistics.
  47. KILT: a benchmark for knowledge intensive language tasks. In Conference of the North American Chapter of the Association for Computational Linguistics.
  48. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of Empirical Methods in Natural Language Processing.
  49. Measuring attribution in natural language generation models. arXiv preprint arXiv:2112.12870.
  50. The role of context in detecting previously fact-checked claims. In Findings of the Association for Computational Linguistics: NAACL 2022.
  51. Crowdsourcing lightweight pyramids for manual summary evaluation. In Conference of the North American Chapter of the Association for Computational Linguistics.
  52. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021.
  53. Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of rlhf. arXiv preprint arXiv:2309.09055.
  54. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  55. Brian Thompson and Matt Post. 2020. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Proceedings of Empirical Methods in Natural Language Processing.
  56. FEVER: a large-scale dataset for fact extraction and VERification. In Conference of the North American Chapter of the Association for Computational Linguistics.
  57. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  58. Fact or fiction: Verifying scientific claims. In Proceedings of Empirical Methods in Natural Language Processing.
  59. SciFact-open: Towards open-domain scientific claim verification. In Findings of the Association for Computational Linguistics: EMNLP.
  60. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the Association for Computational Linguistics.
  61. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of Empirical Methods in Natural Language Processing.
  62. Paraphrastic representations at scale. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.
  63. Generating scientific claims for zero-shot scientific fact checking. In Proceedings of the Association for Computational Linguistics.
  64. A critical evaluation of evaluations for long-form question answering. In Proceedings of the Association for Computational Linguistics.
  65. Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv preprint arXiv:2307.10928.
  66. Automatic evaluation of attribution by large language models. arXiv preprint arXiv:2305.06311.
  67. Shiyue Zhang and Mohit Bansal. 2021. Finding a balanced degree of automation for summary evaluation. In Proceedings of Empirical Methods in Natural Language Processing.
  68. Bertscore: Evaluating text generation with bert. In Proceedings of the International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Sewon Min (45 papers)
  2. Kalpesh Krishna (30 papers)
  3. Xinxi Lyu (5 papers)
  4. Mike Lewis (78 papers)
  5. Wen-tau Yih (84 papers)
  6. Pang Wei Koh (64 papers)
  7. Mohit Iyyer (87 papers)
  8. Luke Zettlemoyer (225 papers)
  9. Hannaneh Hajishirzi (176 papers)
Citations (447)