Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 94 tok/s
Gemini 2.5 Pro 37 tok/s Pro
GPT-5 Medium 33 tok/s
GPT-5 High 35 tok/s Pro
GPT-4o 92 tok/s
GPT OSS 120B 441 tok/s Pro
Kimi K2 227 tok/s Pro
2000 character limit reached

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators (2310.07289v1)

Published 11 Oct 2023 in cs.CL

Abstract: LLMs outperform information retrieval techniques for downstream knowledge-intensive tasks when being prompted to generate world knowledge. However, community concerns abound regarding the factuality and potential implications of using this uncensored knowledge. In light of this, we introduce CONNER, a COmpreheNsive kNowledge Evaluation fRamework, designed to systematically and automatically evaluate generated knowledge from six important perspectives -- Factuality, Relevance, Coherence, Informativeness, Helpfulness and Validity. We conduct an extensive empirical analysis of the generated knowledge from three different types of LLMs on two widely studied knowledge-intensive tasks, i.e., open-domain question answering and knowledge-grounded dialogue. Surprisingly, our study reveals that the factuality of generated knowledge, even if lower, does not significantly hinder downstream tasks. Instead, the relevance and coherence of the outputs are more important than small factual mistakes. Further, we show how to use CONNER to improve knowledge-intensive tasks by designing two strategies: Prompt Engineering and Knowledge Selection. Our evaluation code and LLM-generated knowledge with human annotations will be released to facilitate future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. GPT-Neo: Large scale autoregressive language modeling with Mesh-Tensorflow.
  2. Virginia Braun and Victoria Clarke. 2012. Thematic analysis., pages 57–71.
  3. Towards robust personalized dialogue generation via order-insensitive representation regularization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7337–7345, Toronto, Canada. Association for Computational Linguistics.
  4. Goal awareness for conversational AI: proactivity, non-collaborativity, and beyond. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1–10. Association for Computational Linguistics.
  5. Knowledge-enhanced mixed-initiative dialogue system for emotional support conversations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, pages 4079–4095. Association for Computational Linguistics.
  6. Wizard of wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241.
  7. Faithdial: A faithful benchmark for information-seeking dialogue.
  8. Revisiting text decomposition methods for nli-based factuality scoring of summaries. CoRR, abs/2211.16853.
  9. Revisiting text decomposition methods for nli-based factuality scoring of summaries.
  10. q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering.
  11. OCNLI: Original Chinese Natural Language Inference. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3512–3526, Online. Association for Computational Linguistics.
  12. A discourse coherence model for analyzing chinese students’ essay. In 2017 International Conference on Progress in Informatics and Computing (PIC), pages 430–434.
  13. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In EACL 2021, pages 874–880.
  14. Query expansion by prompting large language models. CoRR, abs/2305.03653.
  15. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
  16. Rethinking self-supervision objectives for generalizable coherence modeling. CoRR, abs/2110.07198.
  17. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  18. Language models (mostly) know what they know. CoRR, abs/2207.05221.
  19. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781.
  20. Internet-augmented dialogue generation.
  21. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
  22. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, pages 9332–9346.
  23. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  24. Factuality enhanced language models for open-ended text generation.
  25. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020.
  26. Halueval: A large-scale hallucination evaluation benchmark for large language models. CoRR, abs/2305.11747.
  27. Eliciting knowledge from large pre-trained models for unsupervised knowledge-grounded conversation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, pages 10551–10564.
  28. Evaluating verifiability in generative search engines.
  29. On learning to summarize with large language models as references. CoRR, abs/2305.14239.
  30. Multi-stage prompting for knowledgeable dialogue generation.
  31. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. CoRR, abs/2303.08896.
  32. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, pages 1906–1919.
  33. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation.
  34. Multi-stage document ranking with BERT. CoRR, abs/1910.14424.
  35. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  36. Training language models to follow instructions with human feedback.
  37. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, Online. Association for Computational Linguistics.
  38. Fact-checking complex claims with program-guided reasoning.
  39. Check your facts and try again: Improving large language models with external knowledge and automated feedback.
  40. Kilt: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544.
  41. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  42. Exploring the limits of transfer learning with a unified text-to-text transformer.
  43. Squad: 100,000+ questions for machine comprehension of text.
  44. Colbertv2: Effective and efficient retrieval via lightweight late interaction. CoRR, abs/2112.01488.
  45. Get your vitamin C! robust fact verification with contrastive evidence. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 624–643, Online. Association for Computational Linguistics.
  46. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  47. Robert H Somers. 1962. A new asymmetric measure of association for ordinal variables. American sociological review, pages 799–811.
  48. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  49. Llama: Open and efficient foundation language models.
  50. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online. Association for Computational Linguistics.
  51. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
  52. GPT-NER: named entity recognition via large language models. CoRR, abs/2304.10428.
  53. Finetuned language models are zero-shot learners.
  54. T2ranking: A large-scale chinese benchmark for passage ranking.
  55. Generate rather than retrieve: Large language models are strong context generators.
  56. Siren’s song in the ai ocean: A survey on hallucination in large language models.
Citations (31)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube