Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Chain of Natural Language Inference for Reducing Large Language Model Ungrounded Hallucinations (2310.03951v2)

Published 6 Oct 2023 in cs.CL and cs.AI

Abstract: LLMs can generate fluent natural language texts when given relevant documents as background context. This ability has attracted considerable interest in developing industry applications of LLMs. However, LLMs are prone to generate hallucinations that are not supported by the provided sources. In this paper, we propose a hierarchical framework to detect and mitigate such ungrounded hallucination. Our framework uses Chain of Natural Language Inference (CoNLI) for hallucination detection and hallucination reduction via post-editing. Our approach achieves state-of-the-art performance on hallucination detection and enhances text quality through rewrite, using LLMs without any fine-tuning or domain-specific prompt engineering. We show that this simple plug-and-play framework can serve as an effective choice for hallucination detection and reduction, achieving competitive performance across various contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  2. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  3. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  4. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, 2020.
  5. Entity-level factual consistency of abstractive text summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2727–2733, 2021.
  6. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  7. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, 2023.
  8. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  9. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  10. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023.
  11. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
  12. Detecting hallucinated content in conditional neural sequence generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1393–1404, 2021.
  13. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, 2020.
  14. Alignscore: Evaluating factual consistency with a unified alignment function. In Annual Meeting of the Association for Computational Linguistics, 2023.
  15. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2214–2220, 2019.
  16. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
  17. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022.
  18. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
  19. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023.
  20. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, 2023.
  21. Helma: A large-scale hallucination evaluation benchmark for large language models. arXiv preprint arXiv:2305.11747, 2023.
  22. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, 2020.
  23. Evaluating factuality in generation with dependency-level entailment. arXiv preprint arXiv:2010.05478, 2020.
  24. Optimizing the factual correctness of a summary: A study of summarizing radiology reports. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5108–5120, 2020.
  25. Faithful to the document or to the world? mitigating hallucinations via entity-linked knowledge in abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1067–1082, 2022.
  26. Kevin Chen-Chuan Chang Shen Zheng, Jie Huang. Why does chatgpt fall short in providing truthful answers? ArXiv preprint, abs/2304.10513, 2023.
  27. Measuring attribution in natural language generation models. arXiv preprint arXiv:2112.12870, 2021.
  28. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439, 2023.
  29. Factual error correction for abstractive summarization models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6251–6258, 2020.
  30. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409, 2021.
  31. Andrew P Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997.
  32. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
  33. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, 2017.
  34. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, 2018.
  35. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
  36. Sources of hallucination by large language models on inference tasks. arXiv preprint arXiv:2305.14552, 2023.
  37. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023.
  38. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  39. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852, 2023.
  40. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Deren Lei (10 papers)
  2. Yaxi Li (4 papers)
  3. Mengya Hu (5 papers)
  4. Mingyu Wang (17 papers)
  5. Vincent Yun (2 papers)
  6. Emily Ching (4 papers)
  7. Eslam Kamal (5 papers)
Citations (34)