A Survey of Automatic Hallucination Evaluation on Natural Language Generation (2404.12041v3)
Abstract: The proliferation of LLMs has introduced a critical challenge: accurate hallucination evaluation that ensures model reliability. While Automatic Hallucination Evaluation (AHE) has emerged as essential, the field suffers from methodological fragmentation, hindering both theoretical understanding and practical advancement. This survey addresses this critical gap through a comprehensive analysis of 74 evaluation methods, revealing that 74% specifically target LLMs, a paradigm shift that demands new evaluation frameworks. We formulate a unified evaluation pipeline encompassing datasets and benchmarks, evidence collection strategies, and comparison mechanisms, systematically documenting the evolution from pre-LLM to post-LLM methodologies. Beyond taxonomical organization, we identify fundamental limitations in current approaches and their implications for real-world deployment. To guide future research, we delineate key challenges and propose strategic directions, including enhanced interpretability mechanisms and integration of application-specific evaluation criteria, ultimately providing a roadmap for developing more robust and practical hallucination evaluation systems.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Shuyang Cao and Lu Wang. CLIFF: Contrastive learning for improving faithfulness and factuality in abstractive summarization. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6633–6649, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.532. URL https://aclanthology.org/2021.emnlp-main.532.
- Inside: Llms’ internal states retain the power of hallucination detection. arXiv preprint arXiv:2402.03744, 2024.
- Beyond factuality: A comprehensive evaluation of large language models as knowledge generators. pp. 6325–6341, Singapore, December 2023a. doi: 10.18653/v1/2023.emnlp-main.390. URL 2023.emnlp-main.390.
- Evaluating factual consistency of summaries with large language models. arXiv preprint arXiv:2305.14069, 2023b.
- Evaluating hallucinations in chinese large language models. arXiv preprint arXiv:2310.03368, 2023.
- Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528, 2023.
- Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity Even better. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 36–50, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.3. URL https://aclanthology.org/2023.acl-long.3.
- Evaluating factuality in text simplification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7331–7345, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.506. URL https://aclanthology.org/2022.acl-long.506.
- Wizard of wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241, 2018.
- Faithful to the document or to the world? mitigating hallucinations via entity-linked knowledge in abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 1067–1082, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.76. URL https://aclanthology.org/2022.findings-emnlp.76.
- Bamboo: A comprehensive benchmark for evaluating long text modeling capacities of large language models. arXiv preprint arXiv:2309.13345, 2023.
- FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5055–5070, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.454. URL https://aclanthology.org/2020.acl-main.454.
- Faithdial: A faithful benchmark for information-seeking dialogue. Transactions of the Association for Computational Linguistics, 10:1473–1490, 2022a.
- Evaluating attribution in dialogue systems: The begin benchmark. Transactions of the Association for Computational Linguistics, 10:1066–1083, 2022b.
- QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2587–2601, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.187. URL https://aclanthology.org/2022.naacl-main.187.
- Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409, 2021.
- Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2214–2220, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1213. URL https://aclanthology.org/P19-1213.
- Chainpoll: A high efficacy method for llm hallucination detection. arXiv preprint arXiv:2310.18344, 2023.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
- DialSummEval: Revisiting summarization evaluation for dialogues. pp. 5693–5709, Seattle, United States, July 2022. doi: 10.18653/v1/2022.naacl-main.418. URL 2022.naacl-main.418.
- Assessing the factual accuracy of generated text. In proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 166–175, 2019.
- Topical-chat: Towards knowledge-grounded open-domain conversations. arXiv preprint arXiv:2308.11995, 2023.
- Evaluating factuality in generation with dependency-level entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3592–3603, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.322. URL https://aclanthology.org/2020.findings-emnlp.322.
- Annotating and modeling fine-grained factuality in summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1449–1462, 2021.
- Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 1059–1075, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.75. URL https://aclanthology.org/2023.eacl-main.75.
- A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206, 2022. doi: 10.1162/tacl˙a˙00454. URL 2022.tacl-1.11.
- DialFact: A benchmark for fact-checking in dialogue. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3785–3801, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.263. URL https://aclanthology.org/2022.acl-long.263.
- q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7856–7870, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.619. URL https://aclanthology.org/2021.emnlp-main.619.
- TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3905–3920, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.287. URL https://aclanthology.org/2022.naacl-main.287.
- What have we achieved on text summarization? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 446–469, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.33. URL https://aclanthology.org/2020.emnlp-main.33.
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
- The factual inconsistency problem in abstractive text summarization: A survey. arXiv preprint arXiv:2104.14839, 2021.
- Ufo: a unified and flexible framework for evaluating factuality of large language models. arXiv preprint arXiv:2402.14690, 2024.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- WiCE: Real-world entailment for claims in Wikipedia. pp. 7561–7583, Singapore, December 2023. doi: 10.18653/v1/2023.emnlp-main.470. URL 2023.emnlp-main.470.
- Realtime qa: What’s the answer right now? Advances in Neural Information Processing Systems, 36, 2024.
- Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9332–9346, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.750. URL https://aclanthology.org/2020.emnlp-main.750.
- SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177, 2022. doi: 10.1162/tacl˙a˙00453. URL https://aclanthology.org/2022.tacl-1.10.
- SummEdits: Measuring LLM ability at factual reasoning through the lens of summarization. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9662–9676, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.600. URL https://aclanthology.org/2023.emnlp-main.600.
- Fast and accurate factual inconsistency detection over long documents. pp. 1691–1703, Singapore, December 2023. doi: 10.18653/v1/2023.emnlp-main.105. URL 2023.emnlp-main.105.
- Hallucinations in neural machine translation. 2018.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=bxsrykzSnq.
- Uhgeval: Benchmarking the hallucination of chinese large language models via unconstrained generation. arXiv preprint arXiv:2311.15296, 2023.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
- G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2511–2522, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.153. URL https://aclanthology.org/2023.emnlp-main.153.
- Maskeval: Weighted mlm-based evaluation for text summarization and simplification. arXiv preprint arXiv:2205.12394, 2022.
- MQAG: Multiple-choice question answering and generation for assessing information consistency in summarization. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 39–53, Nusa Dua, Bali, November 2023a. Association for Computational Linguistics. URL https://aclanthology.org/2023.ijcnlp-main.4.
- SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. pp. 9004–9017, Singapore, December 2023b. doi: 10.18653/v1/2023.emnlp-main.557. URL 2023.emnlp-main.557.
- On faithfulness and factuality in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1906–1919, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.173. URL https://aclanthology.org/2020.acl-main.173.
- FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. pp. 12076–12100, Singapore, December 2023. doi: 10.18653/v1/2023.emnlp-main.741. URL 2023.emnlp-main.741.
- Generating benchmarks for factuality evaluation of language models. arXiv preprint arXiv:2307.06908, 2023.
- Entity-level factual consistency of abstractive text summarization. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2727–2733, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.235. URL https://aclanthology.org/2021.eacl-main.235.
- Erbench: An entity-relationship based automatically verifiable hallucination benchmark for large language models. arXiv preprint arXiv:2403.05266, 2024.
- OpenAI. Chatgpt, 2022. URL https://openai.com/blog/chatgpt/.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4812–4829, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.383. URL https://aclanthology.org/2021.naacl-main.383.
- Med-HALT: Medical domain hallucination test for large language models. pp. 314–334, Singapore, December 2023. doi: 10.18653/v1/2023.conll-1.21. URL 2023.conll-1.21.
- Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
- Don’t be contradicted with anything! CI-ToD: Towards benchmarking consistency for task-oriented dialogue system. pp. 2357–2367, Online and Punta Cana, Dominican Republic, November 2021. doi: 10.18653/v1/2021.emnlp-main.182. URL 2021.emnlp-main.182.
- Know what you don’t know: Unanswerable questions for SQuAD. In Iryna Gurevych and Yusuke Miyao (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. URL https://aclanthology.org/P18-2124.
- The curious case of hallucinations in neural machine translation. pp. 1172–1183, Online, June 2021. doi: 10.18653/v1/2021.naacl-main.92. URL 2021.naacl-main.92.
- Rome was built in 1776: A case study on factual correctness in knowledge-grounded response generation. 2021.
- Get your vitamin C! robust fact verification with contrastive evidence. pp. 624–643, Online, June 2021. doi: 10.18653/v1/2021.naacl-main.52. URL 2021.naacl-main.52.
- Beametrics: A benchmark for language generation evaluation evaluation. arXiv preprint arXiv:2110.09147, 2021.
- Answers unite! unsupervised metrics for reinforced summarization models. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3246–3256, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1320. URL https://aclanthology.org/D19-1320.
- Questeval: Summarization asks for fact-based evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6594–6604, 2021.
- Unsupervised real-time hallucination detection based on the internal states of large language models. arXiv preprint arXiv:2403.06448, 2024.
- Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. pp. 11626–11644, Toronto, Canada, July 2023. doi: 10.18653/v1/2023.acl-long.650. URL 2023.acl-long.650.
- FEVER: a large-scale dataset for fact extraction and VERification. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 809–819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https://aclanthology.org/N18-1074.
- NewsQA: A machine comprehension dataset. In Phil Blunsom, Antoine Bordes, Kyunghyun Cho, Shay Cohen, Chris Dyer, Edward Grefenstette, Karl Moritz Hermann, Laura Rimell, Jason Weston, and Scott Yih (eds.), Proceedings of the 2nd Workshop on Representation Learning for NLP, pp. 191–200, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-2623. URL https://aclanthology.org/W17-2623.
- Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214, 2023.
- Asking and answering questions to evaluate the factual consistency of summaries. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5008–5020, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.450. URL https://aclanthology.org/2020.acl-main.450.
- Chinesefacteval: A factuality benchmark for chinese llms, 2023a.
- Is ChatGPT a good NLG evaluator? a preliminary study. pp. 1–11, Singapore, December 2023b. doi: 10.18653/v1/2023.newsum-1.1. URL 2023.newsum-1.1.
- Factcheck-gpt: End-to-end fine-grained document-level fact-checking and correction of llm output. arXiv preprint arXiv:2311.09000, 2023c.
- Long-form factuality in large language models. arXiv preprint arXiv:2403.18802, 2024.
- Dialogue natural language inference. arXiv preprint arXiv:1811.00671, 2018.
- WeCheck: Strong factual consistency checker via weakly supervised learning. pp. 307–321, Toronto, Canada, July 2023. doi: 10.18653/v1/2023.acl-long.18. URL 2023.acl-long.18.
- Factual consistency evaluation for text summarization via counterfactual estimation. pp. 100–110, Punta Cana, Dominican Republic, November 2021. doi: 10.18653/v1/2021.findings-emnlp.10. URL 2021.findings-emnlp.10.
- Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics, 3:283–297, 2015.
- A new benchmark and reverse validation method for passage-level hallucination detection. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 3898–3908, 2023.
- Kola: Carefully benchmarking world knowledge of large language models. arXiv preprint arXiv:2306.09296, 2023.
- Alignscore: Evaluating factual consistency with a unified alignment function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 11328–11348, 2023.
- SAC33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. pp. 15445–15458, Singapore, December 2023a. doi: 10.18653/v1/2023.findings-emnlp.1032. URL 2023.findings-emnlp.1032.
- Extractive is not faithful: An investigation of broad unfaithfulness problems in extractive summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2153–2174, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.120. URL https://aclanthology.org/2023.acl-long.120.
- Sentence simplification with deep reinforcement learning. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 584–594, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1062. URL https://aclanthology.org/D17-1062.
- Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023c.
- Felm: Benchmarking factuality evaluation of large language models. Advances in Neural Information Processing Systems, 36, 2024.