Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models (2401.03205v1)

Published 6 Jan 2024 in cs.CL

Abstract: In the era of LLMs, hallucination (i.e., the tendency to generate factually incorrect content) poses great challenge to trustworthy and reliable deployment of LLMs in real-world applications. To tackle the LLM hallucination, three key questions should be well studied: how to detect hallucinations (detection), why do LLMs hallucinate (source), and what can be done to mitigate them (mitigation). To address these challenges, this work presents a systematic empirical study on LLM hallucination, focused on the the three aspects of hallucination detection, source and mitigation. Specially, we construct a new hallucination benchmark HaluEval 2.0, and designs a simple yet effective detection method for LLM hallucination. Furthermore, we zoom into the different training or utilization stages of LLMs and extensively analyze the potential factors that lead to the LLM hallucination. Finally, we implement and examine a series of widely used techniques to mitigate the hallucinations in LLMs. Our work has led to several important findings to understand the hallucination origin and mitigate the hallucinations in LLMs. Our code and data can be accessed at https://github.com/RUCAIBox/HaluEval-2.0.

An Empirical Study on Factuality Hallucination in LLMs

This paper, "The Dawn After the Dark: An Empirical Study on Factuality Hallucination in LLMs," explores the pervasive issue of hallucination in LLMs. Hallucinations refer to the generation of content that is factually incorrect. These models, while capable of producing remarkably coherent text, often generate information that isn't grounded in reality, posing significant challenges for their application in critical areas such as clinical diagnoses.

The paper targets three pivotal questions concerning hallucinations in LLMs: detection, source, and mitigation. The authors introduce a benchmark, HaluEval 2.0, designed specifically to evaluate hallucination in these models. Comprising 8,770 questions across diverse domains like biomedicine, finance, science, education, and open domains, this benchmark allows for a comprehensive assessment of LLMs' propensity to hallucinate.

A novel detection method is proposed, utilizing a two-step approach: extracting factual statements from LLM outputs and evaluating these against known world knowledge using a LLM. This method achieved high reliability, with a matching rate exceeding 90% across human-annotated benchmarks, demonstrating effectiveness in identifying hallucinations.

Hallucination Sources

The paper explores multiple sources of hallucinations:

  1. Pre-training: The amount and type of data used in pre-training significantly influence hallucination rates. Models pre-trained with specialized datasets exhibit reduced hallucination in corresponding domains, confirming that domain-specific pre-training can mitigate these errors.
  2. Supervised Fine-Tuning: Fine-tuning with task-specific instructions increases the likelihood of hallucinations, whereas daily-chat instructions show reduced hallucination rates. A balanced complexity in instructions aids in minimizing hallucination.
  3. Inference Methods: Different decoding strategies impact hallucination rates. Diversity-oriented decoding methods increase hallucinations in professional domains, while greedy search exacerbates hallucinations in open-ended domains.
  4. Prompt Design: Rich, detailed prompts reduce hallucination, especially in professional domains. Incorporating in-context examples and well-crafted task descriptions leads to lower hallucination rates.

Mitigation Strategies

Several strategies were evaluated for their efficacy in mitigating hallucinations:

  • RLHF (Reinforcement Learning from Human Feedback) aligns model outputs with human values, significantly lowering hallucination rates, especially in open domains.
  • Retrieval Augmentation dramatically reduces hallucinations by providing models with access to accurate knowledge during generation, particularly effective for smaller models.
  • Self-Reflexion helps models rectify their mistakes in subsequent iterations, although its effectiveness hinges on model scale, showing significant impact only in larger models.
  • Advanced Decoding techniques that balance diversity and accuracy can effectively diminish hallucination rates.
  • Prompt Improvement via detailed task information and role definition, combined with Chain-of-Thought (CoT) prompting, can aid models with robust reasoning abilities in reducing hallucination.

Implications and Future Prospects

This empirical paper provides crucial insights into the nature of hallucinations in LLMs and potential avenues for ameliorating this issue. The findings have significant implications for the deployment of LLMs in settings requiring factual correctness and reliability. As LLMs continue to evolve, understanding and controlling their tendency to hallucinate will be essential. The strategies explored in this paper may serve as groundwork for future development. The necessity for domain-specific pre-training, sophisticated decoding strategies, and retrieval augmentation are critical considerations moving forward, especially as these models are integrated into more sensitive and high-stakes applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Do language models know when they’re hallucinating references? arXiv preprint arXiv:2305.18248.
  2. Falcon-40B: an open large language model with state-of-the-art performance.
  3. Amos Azaria and Tom Mitchell. 2023a. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734.
  4. Amos Azaria and Tom M. Mitchell. 2023b. The internal state of an LLM knows when its lying. CoRR, abs/2304.13734.
  5. Gpt-neox-20b: An open-source autoregressive language model. CoRR, abs/2204.06745.
  6. A full-text learning to rank dataset for medical information retrieval. In Advances in Information Retrieval - 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20-23, 2016. Proceedings, volume 9626 of Lecture Notes in Computer Science, pages 716–722. Springer.
  7. Quality-diversity through AI feedback. CoRR, abs/2310.13032.
  8. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  9. Purr: Efficiently editing language model hallucinations by denoising language model corruptions. arXiv preprint arXiv:2305.14908.
  10. Learningq: A large-scale dataset for educational question generation. In Proceedings of the Twelfth International Conference on Web and Social Media, ICWSM 2018, Stanford, California, USA, June 25-28, 2018, pages 481–490. AAAI Press.
  11. Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528.
  12. Factool: Factuality detection in generative AI - A tool augmented framework for multi-task and multi-domain scenarios. CoRR, abs/2307.13528.
  13. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  14. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
  15. Diving deep into modes of fact hallucinations in dialogue systems. arXiv preprint arXiv:2301.04449.
  16. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
  17. Chain-of-verification reduces hallucination in large language models. CoRR, abs/2309.11495.
  18. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 2197–2214. Association for Computational Linguistics.
  19. Dom Eccleston. 2023. Sharegpt. https://sharegpt.com/.
  20. Bridging the gap: A survey on integrating (human) feedback for natural language generation. arXiv preprint arXiv:2305.00955.
  21. A survey of quantization methods for efficient neural network inference. CoRR, abs/2103.13630.
  22. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. CoRR, abs/2311.05232.
  23. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  24. Towards mitigating hallucination in large language models via self-reflection. CoRR, abs/2310.06271.
  25. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  26. Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution. arXiv preprint arXiv:2307.16883.
  27. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781.
  28. Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data, 10:170.
  29. Factuality enhanced language models for open-ended text generation. In NeurIPS.
  30. Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems, 35:34586–34599.
  31. Halueval: A large-scale hallucination evaluation benchmark for large language models. CoRR, abs/2305.11747.
  32. The web can be your oyster for improving language models. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 728–746. Association for Computational Linguistics.
  33. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341.
  34. How pre-trained language models capture factual knowledge? A causal-inspired analysis. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 1720–1732. Association for Computational Linguistics.
  35. Lost in the middle: How language models use long contexts. CoRR, abs/2307.03172.
  36. Fiona Macpherson and Dimitris Platchias. 2013. Hallucination: Philosophy and psychology. MIT Press.
  37. Www’18 open challenge: financial opinion mining and question answering. In Companion Proceedings of the The Web Conference 2018, pages 1941–1942.
  38. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
  39. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852.
  40. OpenAI. 2023. Gpt-4 technical report. OpenAI.
  41. Training language models to follow instructions with human feedback. CoRR, abs/2203.02155.
  42. Exploring the relationship between LLM hallucinations and prompt linguistic nuances: Readability, formality, and concreteness. CoRR, abs/2309.11064.
  43. Exploring the relationship between llm hallucinations and prompt linguistic nuances: Readability, formality, and concreteness. arXiv preprint arXiv:2309.11064.
  44. Investigating the factual knowledge boundary of large language models with retrieval augmentation. arXiv preprint arXiv:2307.11019.
  45. Shubhra Kanti Karmaker Santu and Dongji Feng. 2023. Teler: A general taxonomy of LLM prompts for benchmarking complex tasks. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 14197–14203. Association for Computational Linguistics.
  46. The intended uses of automated fact-checking artefacts: Why, how and who. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 8618–8642. Association for Computational Linguistics.
  47. John Schulman. 2023. Reinforcement learning from human feedback: Progress and challenges.
  48. Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739.
  49. Trusting your evidence: Hallucinate less with context-aware decoding. CoRR, abs/2305.14739.
  50. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  51. Supervised open information extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 885–895, New Orleans, Louisiana. Association for Computational Linguistics.
  52. Moss: Training conversational language models from synthetic data.
  53. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  54. Galactica: A large language model for science. CoRR, abs/2211.09085.
  55. The fact extraction and VERification (FEVER) shared task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 1–9, Brussels, Belgium. Association for Computational Linguistics.
  56. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  57. Med-halt: Medical domain hallucination test for large language models. arXiv preprint arXiv:2307.15343.
  58. Mutual information alleviates hallucinations in abstractive summarization. arXiv preprint arXiv:2210.13210.
  59. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987.
  60. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 7534–7550. Association for Computational Linguistics.
  61. Knowledge-driven cot: Exploring faithful reasoning in llms for knowledge-intensive question answering. CoRR, abs/2308.13259.
  62. Self-instruct: Aligning language model with self generated instructions. CoRR, abs/2212.10560.
  63. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  64. Wizardlm: Empowering large language models to follow complex instructions. CoRR, abs/2304.12244.
  65. Baichuan 2: Open large-scale language models. CoRR, abs/2309.10305.
  66. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380.
  67. Llm lies: Hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469.
  68. Cognitive mirage: A review of hallucinations in large language models. CoRR, abs/2309.06794.
  69. Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063.
  70. Attention satisfies: A constraint-satisfaction lens on factual errors of language models. arXiv preprint arXiv:2309.15098.
  71. YuLan-Team. 2023. Yulan-chat: An open-source bilingual chatbot. https://github.com/RUC-GSAI/YuLan-Chat.
  72. How language model hallucinations can snowball. CoRR, abs/2305.13534.
  73. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  74. Siren’s song in the AI ocean: A survey on hallucination in large language models. CoRR, abs/2309.01219.
  75. A survey of large language models. arXiv preprint arXiv:2303.18223.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Junyi Li (92 papers)
  2. Jie Chen (602 papers)
  3. Ruiyang Ren (18 papers)
  4. Xiaoxue Cheng (12 papers)
  5. Wayne Xin Zhao (196 papers)
  6. Jian-Yun Nie (70 papers)
  7. Ji-Rong Wen (299 papers)
Citations (21)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com