DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models (2403.00896v3)
Abstract: Since LLMs achieve significant success in recent years, the hallucination issue remains a challenge, numerous benchmarks are proposed to detect the hallucination. Nevertheless, some of these benchmarks are not naturally generated by LLMs but are intentionally induced. Also, many merely focus on the factuality hallucination while ignoring the faithfulness hallucination. Additionally, although dialogue pattern is more widely utilized in the era of LLMs, current benchmarks only concentrate on sentence-level and passage-level hallucination. In this study, we propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge. Initially, we integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5. Subsequently, we manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios. Finally, professional scholars annotate all the samples in the dataset. DiaHalu covers four common multi-turn dialogue domains and five hallucination subtypes, extended from factuality and faithfulness hallucination. Experiments through some well-known LLMs and detection methods on the dataset show that DiaHalu is a challenging benchmark, holding significant value for further research.
- Mathqa: Towards interpretable math word problem solving with operation-based formalisms.
- A stitch in time saves nine: Enabling early anomaly detection with correlation analysis. In 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3-7, 2023, pages 1832–145. IEEE.
- Ron Artstein. 2017. Inter-annotator agreement. Handbook of linguistic annotation, pages 297–313.
- Amos Azaria and Tom M. Mitchell. 2023. The internal state of an LLM knows when its lying. CoRR, abs/2304.13734.
- Multiwoz - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 5016–5026. Association for Computational Linguistics.
- A survey on dialogue systems: Recent advances and new frontiers. SIGKDD Explor., 19(2):25–35.
- Beyond factuality: A comprehensive evaluation of large language models as knowledge generators. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 6325–6341. Association for Computational Linguistics.
- FELM: benchmarking factuality evaluation of large language models. CoRR, abs/2310.00741.
- Unveiling the siren’s song: Towards reliable fact-conflicting hallucination detection. CoRR, abs/2310.12086.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- Survey on evaluation methods for dialogue systems. Artif. Intell. Rev., 54(1):755–810.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
- Recommender systems in the era of large language models (llms). CoRR, abs/2307.02046.
- Towards revealing the mystery behind chain of thought: a theoretical perspective. CoRR, abs/2305.15408.
- Factkb: Generalizable factuality evaluation using language models enhanced with factual knowledge. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 933–952. Association for Computational Linguistics.
- Are large language models reliable judges? A study on the factuality evaluation capabilities of llms. CoRR, abs/2311.00681.
- Retrieval-augmented generation for large language models: A survey. CoRR, abs/2312.10997.
- A knowledge-grounded neural conversation model. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5110–5117. AAAI Press.
- Language models hallucinate, but may excel at fact verification. CoRR, abs/2310.14564.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1049–1065. Association for Computational Linguistics.
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. CoRR, abs/2311.05232.
- Look before you leap: An exploratory study of uncertainty measurement for large language models. CoRR, abs/2307.10236.
- Faithful persona-based conversational dataset generation with large language models. CoRR, abs/2312.10007.
- Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12):248:1–248:38.
- Using large language models to assess tutors’ performance in reacting to students making math errors. CoRR, abs/2401.03238.
- Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 6449–6464. Association for Computational Linguistics.
- Leveraging large language models for nlg evaluation: A survey.
- Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3214–3252. Association for Computational Linguistics.
- A token-level reference-free hallucination detection benchmark for free-form text generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 6723–6737. Association for Computational Linguistics.
- Hallucination detection and hallucination mitigation: An investigation. CoRR, abs/2401.08358.
- Zero-resource hallucination prevention for large language models. CoRR, abs/2309.02654.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9004–9017. Association for Computational Linguistics.
- Generating benchmarks for factuality evaluation of language models. CoRR, abs/2307.06908.
- Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation.
- Large language models in medicine: the potentials and pitfalls. CoRR, abs/2309.00087.
- OpenAI. 2023. Gpt-4 technical report.
- Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering, pages 1–20.
- Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 2463–2473. Association for Computational Linguistics.
- Justus J Randolph. 2005. Free-marginal multirater kappa (multirater k [free]): An alternative to fleiss’ fixed-marginal multirater kappa. Online submission.
- Factgraph: Evaluating factuality in summarization with semantic graph representations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 3238–3253. Association for Computational Linguistics.
- The curious case of hallucinatory unanswerablity: Finding truths in the hidden states of over-confident large language models. CoRR, abs/2310.11877.
- Do we know what we don’t know? studying unanswerable questions beyond squad 2.0. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pages 4543–4548. Association for Computational Linguistics.
- Adding chit-chat to enhance task-oriented dialogues. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 1570–1583. Association for Computational Linguistics.
- Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 641–651. Association for Computational Linguistics.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4149–4158. Association for Computational Linguistics.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
- Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. CoRR, abs/2310.07521.
- Zero-shot information extraction via chatting with chatgpt. CoRR, abs/2302.10205.
- A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers, pages 438–449. Association for Computational Linguistics.
- The dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference, The 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 22-24 August 2013, SUPELEC, Metz, France, pages 404–413. The Association for Computer Linguistics.
- Bloomberggpt: A large language model for finance. CoRR, abs/2303.17564.
- A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10(5):1122–1136.
- A new benchmark and reverse validation method for passage-level hallucination detection. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 3898–3908. Association for Computational Linguistics.
- Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Siren’s song in the AI ocean: A survey on hallucination in large language models. CoRR, abs/2309.01219.
- A survey of large language models. CoRR, abs/2303.18223.
- Knowing what llms DO NOT know: A simple yet effective self-detection method. CoRR, abs/2310.17918.
- Why does chatgpt fall short in providing truthful answers?