Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

KnowHalu: Hallucination Detection via Multi-Form Knowledge Based Factual Checking (2404.02935v1)

Published 3 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: This paper introduces KnowHalu, a novel approach for detecting hallucinations in text generated by LLMs, utilizing step-wise reasoning, multi-formulation query, multi-form knowledge for factual checking, and fusion-based detection mechanism. As LLMs are increasingly applied across various domains, ensuring that their outputs are not hallucinated is critical. Recognizing the limitations of existing approaches that either rely on the self-consistency check of LLMs or perform post-hoc fact-checking without considering the complexity of queries or the form of knowledge, KnowHalu proposes a two-phase process for hallucination detection. In the first phase, it identifies non-fabrication hallucinations--responses that, while factually correct, are irrelevant or non-specific to the query. The second phase, multi-form based factual checking, contains five key steps: reasoning and query decomposition, knowledge retrieval, knowledge optimization, judgment generation, and judgment aggregation. Our extensive evaluations demonstrate that KnowHalu significantly outperforms SOTA baselines in detecting hallucinations across diverse tasks, e.g., improving by 15.65% in QA tasks and 5.50% in summarization tasks, highlighting its effectiveness and versatility in detecting hallucinations in LLM-generated content.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
  2. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
  3. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  4. Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031, 2023.
  5. Jean-Philippe Vert. How will generative ai disrupt data science in drug discovery? Nature Biotechnology, pages 1–2, 2023.
  6. Neil Savage. Drug discovery companies are customizing chatgpt: here’s how. Nature Biotechnology, 2023.
  7. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  8. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734, 2023.
  9. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
  10. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
  11. WikiChat: Stopping the hallucination of large language model chatbots by few-shot grounding on Wikipedia. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2387–2413, Singapore, December 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.findings-emnlp.157.
  12. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  13. Hallucinations in neural machine translation. 2018.
  14. Constrained decoding for neural nlg from compositional representations in task-oriented dialogue. arXiv preprint arXiv:1906.07220, 2019.
  15. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. arXiv preprint arXiv:2005.03754, 2020.
  16. Bleurt: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696, 2020.
  17. q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. arXiv preprint arXiv:2104.08202, 2021.
  18. The factual inconsistency problem in abstractive text summarization: A survey. arXiv preprint arXiv:2104.14839, 2021.
  19. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, 2023.
  20. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  21. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
  22. Chain-of-verification reduces hallucination in large language models, 2023.
  23. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637, 2020.
  24. Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566, 2021.
  25. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. arXiv preprint arXiv:2203.13224, 2022a.
  26. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.
  27. Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv preprint arXiv:2208.03188, 2022b.
  28. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
  29. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023.
  30. Colbertv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, 2022a.
  31. Plaid: an efficient engine for late interaction retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 1747–1756, 2022b.
  32. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018.
  33. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, 2017.
  34. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  35. Berkeley Nest. Starling-lm-7b-alpha. https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha, 2023.
  36. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023.
  37. LMSYS. chatbot-arena-leaderboard. https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard, 2023.
  38. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  39. BAAI. bge-large-en-v1.5. https://huggingface.co/BAAI/bge-large-en-v1.5, 2023.
  40. Lm-cocktail: Resilient tuning of language models via model merging, 2023a.
  41. Retrieve anything to augment large language models, 2023.
  42. C-pack: Packaged resources to advance general chinese embedding, 2023b.
  43. How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiawei Zhang (529 papers)
  2. Chejian Xu (18 papers)
  3. Yu Gai (9 papers)
  4. Freddy Lecue (36 papers)
  5. Dawn Song (229 papers)
  6. Bo Li (1107 papers)
Citations (5)