FaaF: Facts as a Function for the evaluation of generated text (2403.03888v3)
Abstract: The demand for accurate and efficient verification of information in texts generated by large LMs is at an all-time high, but remains unresolved. Recent efforts have focused on extracting and verifying atomic facts from these texts via prompting LM evaluators. However, we demonstrate that this method of prompting is unreliable when faced with incomplete or inaccurate reference information. We introduce Facts as a Function (FaaF), a new approach to the fact verification task that leverages the function-calling capabilities of LMs. FaaF significantly enhances the ability of LMs to identify unsupported facts in texts, while also improving efficiency and significantly lowering costs compared to prompt-based methods. Additionally, we propose a framework for evaluating factual recall in Retrieval Augmented Generation (RAG) systems, which we employ to compare prompt-based and FaaF methods using various LMs under challenging conditions.
- Feverous: Fact extraction and verification over unstructured and structured information.
- Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when it’s lying.
- Generating literal and implied subquestions to fact-check complex claims.
- The power of noise: Redefining retrieval for rag systems.
- Ragas: Automated evaluation of retrieval augmented generation.
- Gptscore: Evaluate as you desire.
- Rarr: Researching and revising what language models say, using language models.
- Language models (mostly) know what they know.
- Large language models struggle to learn long-tail knowledge.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
- Latent retrieval for weakly supervised open domain question answering.
- Factuality enhanced language models for open-ended text generation.
- Halueval: A large-scale hallucination evaluation benchmark for large language models.
- Lost in the middle: How language models use long contexts.
- When not to trust language models: Investigating effectiveness of parametric and non-parametric memories.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.
- Factscore: Fine-grained atomic evaluation of factual precision in long form text generation.
- Is chatgpt a good nlg evaluator? a preliminary study.
- Chain-of-thought prompting elicits reasoning in large language models.
- Bartscore: Evaluating generated text as text generation.
- Interpretable unified language checking.
- Bertscore: Evaluating text generation with bert.
- Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.