Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 110 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

FaaF: Facts as a Function for the evaluation of generated text (2403.03888v3)

Published 6 Mar 2024 in cs.CL

Abstract: The demand for accurate and efficient verification of information in texts generated by large LMs is at an all-time high, but remains unresolved. Recent efforts have focused on extracting and verifying atomic facts from these texts via prompting LM evaluators. However, we demonstrate that this method of prompting is unreliable when faced with incomplete or inaccurate reference information. We introduce Facts as a Function (FaaF), a new approach to the fact verification task that leverages the function-calling capabilities of LMs. FaaF significantly enhances the ability of LMs to identify unsupported facts in texts, while also improving efficiency and significantly lowering costs compared to prompt-based methods. Additionally, we propose a framework for evaluating factual recall in Retrieval Augmented Generation (RAG) systems, which we employ to compare prompt-based and FaaF methods using various LMs under challenging conditions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Feverous: Fact extraction and verification over unstructured and structured information.
  2. Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when it’s lying.
  3. Generating literal and implied subquestions to fact-check complex claims.
  4. The power of noise: Redefining retrieval for rag systems.
  5. Ragas: Automated evaluation of retrieval augmented generation.
  6. Gptscore: Evaluate as you desire.
  7. Rarr: Researching and revising what language models say, using language models.
  8. Language models (mostly) know what they know.
  9. Large language models struggle to learn long-tail knowledge.
  10. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  11. Latent retrieval for weakly supervised open domain question answering.
  12. Factuality enhanced language models for open-ended text generation.
  13. Halueval: A large-scale hallucination evaluation benchmark for large language models.
  14. Lost in the middle: How language models use long contexts.
  15. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories.
  16. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.
  17. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation.
  18. Is chatgpt a good nlg evaluator? a preliminary study.
  19. Chain-of-thought prompting elicits reasoning in large language models.
  20. Bartscore: Evaluating generated text as text generation.
  21. Interpretable unified language checking.
  22. Bertscore: Evaluating text generation with bert.
  23. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 3 likes.

Upgrade to Pro to view all of the tweets about this paper: