Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models? (2204.07931v1)

Published 17 Apr 2022 in cs.CL

Abstract: Knowledge-grounded conversational models are known to suffer from producing factually invalid statements, a phenomenon commonly called hallucination. In this work, we investigate the underlying causes of this phenomenon: is hallucination due to the training data, or to the models? We conduct a comprehensive human study on both existing knowledge-grounded conversational benchmarks and several state-of-the-art models. Our study reveals that the standard benchmarks consist of >60% hallucinated responses, leading to models that not only hallucinate but even amplify hallucinations. Our findings raise important questions on the quality of existing datasets and models trained using them. We make our annotations publicly available for future research.

Analysis of Hallucinations in Knowledge-Grounded Conversational Models

The phenomenon of hallucination in conversational models, where systems generate factually incorrect content, presents a significant challenge in the field of AI. Despite efforts to mitigate this issue by enhancing model robustness, the underlying cause remains poorly understood. This paper by Dziri et al. investigates whether hallucinations originate from the datasets used for training or from the models themselves. Through a detailed human paper on existing knowledge-grounded conversational benchmarks and state-of-the-art models, this research provides insights into the prevalent hallucination behavior observed in conversational AI systems.

The paper identifies that standard benchmarks such as Wizard of Wikipedia, CMU-DOG, and TOPICALCHAT consist of more than 60% hallucinated responses. The authors used a classification taxonomy to distinguish responses supported by knowledge snippets from those not verifiable by the provided evidence. The findings from both expert and non-expert annotations reveal a substantial amount of hallucination in dialogues, comprising subjective content like personal opinions and unsupported factual information. This revelation calls into question the quality and suitability of the existing datasets for training knowledge-grounded conversational systems.

Moreover, the paper extends beyond dataset quality by evaluating the performance of several conversational models trained on these benchmarks, including GPT2, DoHA, and CTRL. The results demonstrate that these models not only reflect the hallucination tendencies present in training data but also amplify them during generation. In particular, GPT2 exhibits an increased severity of hallucination in generated responses compared to the training data, while CTRL, albeit producing less hallucination, tends to generate uncooperative responses that lack coherence with conversational history. This evidences that both data quality and model design contribute to the hallucination issue.

The implications of these findings are profound. Practically, they highlight the necessity for improved data curation processes and model training methodologies to advance the reliability of dialogue systems in applications ranging from customer service to healthcare. Theoretically, this work underscores the need for in-depth exploration of algorithmic biases and training protocols that exacerbate hallucination. It further suggests potential research directions in refining evaluation metrics, developing faithful conversational AI models, and understanding the robustness of various decoding strategies.

Future research in artificial intelligence could benefit from these insights by re-evaluating the benchmarks traditionally used in the domain and fostering the development of novel approaches that address both data and modeling deficiencies. Recognizing the hallucination phenomenon as a multifaceted problem arising from data, model learning dynamics, and pre-training biases, could guide more effective strategies for mitigating its impact.

In conclusion, the paper offers a comprehensive audit of hallucination in dialogue systems, urging the AI community to prioritize dataset integrity and model robustness to achieve trustworthy conversational AI. As the conversational AI landscape continues to evolve, this paper serves as a critical reminder of the ongoing challenges and the importance of foundational work in ensuring high-quality AI deployments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Nouha Dziri (39 papers)
  2. Sivan Milton (3 papers)
  3. Mo Yu (117 papers)
  4. Osmar Zaiane (43 papers)
  5. Siva Reddy (82 papers)
Citations (173)