Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models (2401.00396v2)

Published 31 Dec 2023 in cs.CL
RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

Abstract: Retrieval-augmented generation (RAG) has become a main technique for alleviating hallucinations in LLMs. Despite the integration of RAG, LLMs may still present unsupported or contradictory claims to the retrieved contents. In order to develop effective hallucination prevention strategies under RAG, it is important to create benchmark datasets that can measure the extent of hallucination. This paper presents RAGTruth, a corpus tailored for analyzing word-level hallucinations in various domains and tasks within the standard RAG frameworks for LLM applications. RAGTruth comprises nearly 18,000 naturally generated responses from diverse LLMs using RAG. These responses have undergone meticulous manual annotations at both the individual cases and word levels, incorporating evaluations of hallucination intensity. We not only benchmark hallucination frequencies across different LLMs, but also critically assess the effectiveness of several existing hallucination detection methodologies. Furthermore, we show that using a high-quality dataset such as RAGTruth, it is possible to finetune a relatively small LLM and achieve a competitive level of performance in hallucination detection when compared to the existing prompt-based approaches using state-of-the-art LLMs such as GPT-4.

Analysis of Hallucination in Retrieval-Augmented LLMs with RAGTruth

The paper "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented LLMs" provides a comprehensive examination of hallucinations in retrieval-augmented generation (RAG) frameworks. RAG is an instrumental technique in minimizing hallucination phenomena inherent in LLMs by integrating retrieval mechanisms into generation processes, thereby enriching the context with relevant information. Despite advancements in RAG, LLMs still occasionally make unsupported claims or contradict retrieved contents, thus necessitating focused research and development in hallucination detection strategies.

Objectives and Contributions

The paper presents RAGTruth, a corpus tailored for studying word-level hallucinations across diverse domains and tasks standard in RAG frameworks. The dataset comprises nearly 18,000 naturally generated responses, meticulously annotated both at individual case and word levels, with emphasis on hallucination intensity. Using RAGTruth, several key contributions are proposed:

  1. Dataset Introduction: RAGTruth places a spotlight on word-level hallucination evaluations, distinguishing itself by focusing on naturally generated responses within the RAG context. Compared to previous datasets, RAGTruth represents a significant leap in scale and scope, featuring detailed annotations necessary for in-depth hallucination analysis.
  2. Comparative Benchmarks: An extensive comparison of existing hallucination detection methodologies is performed. The analyses focus on both passage-level and word-level indicators, revealing the strengths and limitations of current approaches.
  3. Fine-Tuning Efficiency: It demonstrates the potential efficacy of fine-tuning relatively smaller models, such as Llama-2-13B, with RAGTruth, positioning them competitively against prompt-based methods involving state-of-the-art models like GPT-4.
  4. Mitigation Success: The paper illustrates the practical advantage of fine-tuned models in reducing hallucinations, benefitting even models known for their low hallucination rates, such as GPT-4.

Hallucination Categories and Analysis

Hallucinations are categorized into evident and subtle conflicts, alongside evident and subtle introductions of baseless information. The paper enumerates how different tasks, like question answering, data-to-text writing, and news summarization, exhibit varying hallucination frequencies. Data-to-text writing, in particular, showed the highest occurrence due to inconsistencies in handling structured data formats.

Discussion of Detection Methods

The experimental setup includes various algorithms such as hallucination detection prompts, SelfCheckGPT, LMvLM, and specialized fine-tuning. Results indicate that fine-tuning using RAGTruth sharply enhances detection prowess, yet challenges still remain at the span-level detection front. Current methods display limitations in precision and recall, highlighting the intricacies of accurately pinpointing hallucinations.

Implications and Future Directions

The paper underscores the persistent challenge of identifying hallucinations within RAG contexts, especially at granular levels. It calls for the continuation of efforts to optimize hallucination detection mechanisms, with RAGTruth heralding future developments in trustworthy LLM applications. Furthermore, the research demonstrates the utility and effectiveness of fine-tuned models, signifying a promising trajectory for model specialization in hallucination detection.

RAGTruth's inception signifies a pivotal shift in corpus creation aimed at tackling hallucination phenomena. By facilitating both empirical and theoretical advances, it serves as a cornerstone for subsequent innovations, methodologically advancing AI towards more reliable and accurate deployment in real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Do language models know when they’re hallucinating references?
  2. Amos Azaria and Tom Mitchell. 2023. The Internal State of an LLM Knows When its Lying.
  3. Adversarial nli for factual correctness in text summarisation models. arXiv preprint arXiv:2005.11739.
  4. Language models are few-shot learners.
  5. Autohall: Automated hallucination dataset generation for large language models. ArXiv, abs/2310.00259.
  6. Felm: Benchmarking factuality evaluation of large language models.
  7. Factool: Factuality detection in generative ai – a tool augmented framework for multi-task and multi-domain scenarios.
  8. Lm vs lm: Detecting factual errors via cross examination.
  9. Chain-of-verification reduces hallucination in large language models.
  10. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy. Association for Computational Linguistics.
  11. RARR: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, Toronto, Canada. Association for Computational Linguistics.
  12. On calibration of modern neural networks.
  13. A survey on automated fact-checking.
  14. Lora: Low-rank adaptation of large language models.
  15. Bschecker for fine-grained hallucination detection.
  16. Mistral 7b.
  17. Challenges and applications of large language models.
  18. Wice: Real-world entailment for claims in wikipedia. In Conference on Empirical Methods in Natural Language Processing.
  19. Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520.
  20. Retrieval-augmented generation for knowledge-intensive nlp tasks.
  21. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models.
  22. Pretrained language models for text generation: A survey. arXiv preprint arXiv:2201.05273.
  23. Truthfulqa: Measuring how models mimic human falsehoods.
  24. Alisa Liu and Jiacheng Liu. 2023. The memotrap dataset. https://github.com/liujch1998/memo-trap.
  25. Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  26. Andrey Malinin and Mark Gales. 2021. Uncertainty estimation in autoregressive structured prediction.
  27. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories.
  28. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models.
  29. MS MARCO: A human generated machine reading comprehension dataset. CoRR, abs/1611.09268.
  30. OpenAI. 2023. Gpt-4 technical report.
  31. Training language models to follow instructions with human feedback.
  32. Med-halt: Medical domain hallucination test for large language models.
  33. A Survey of Hallucination in Large Foundation Models.
  34. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
  35. " why is this misleading?": Detecting news headline hallucinations with explanations. arXiv preprint arXiv:2302.05852.
  36. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567.
  37. Label Studio: Data labeling software. Open source software available from https://github.com/heartexlabs/label-studio.
  38. Llama 2: Open foundation and fine-tuned chat models.
  39. Freshllms: Refreshing large language models with search engine augmentation.
  40. Asking and answering questions to evaluate the factual consistency of summaries.
  41. Yijun Xiao and William Yang Wang. 2021. On hallucination and predictive uncertainty in conditional language generation.
  42. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.
  43. Yelp. 2021. Yelp open dataset. https://www.yelp.com/dataset. Accessed: 2023-11-03.
  44. R-tuning: Teaching large language models to refuse unknown questions. arXiv preprint arXiv:2311.09677.
  45. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.
  46. A survey of large language models.
  47. Judging llm-as-a-judge with mt-bench and chatbot arena.
  48. Lima: Less is more for alignment.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yuanhao Wu (4 papers)
  2. Juno Zhu (3 papers)
  3. Siliang Xu (3 papers)
  4. Kashun Shum (7 papers)
  5. Cheng Niu (15 papers)
  6. Randy Zhong (3 papers)
  7. Juntong Song (5 papers)
  8. Tong Zhang (569 papers)
Citations (42)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com