The HalluRAG Dataset: Detecting Closed-Domain Hallucinations in RAG Applications Using an LLM's Internal States (2412.17056v2)

Published 22 Dec 2024 in cs.CL and cs.LG

Abstract: Detecting hallucinations in LLMs is critical for enhancing their reliability and trustworthiness. Most research focuses on hallucinations as deviations from information seen during training. However, the opaque nature of an LLM's parametric knowledge complicates the understanding of why generated texts appear ungrounded: The LLM might not have picked up the necessary knowledge from large and often inaccessible datasets, or the information might have been changed or contradicted during further training. Our focus is on hallucinations involving information not used in training, which we determine by using recency to ensure the information emerged after a cut-off date. This study investigates these hallucinations by detecting them at sentence level using different internal states of various LLMs. We present HalluRAG, a dataset designed to train classifiers on these hallucinations. Depending on the model and quantization, MLPs trained on HalluRAG detect hallucinations with test accuracies ranging up to 75 %, with Mistral-7B-Instruct-v0.1 achieving the highest test accuracies. Our results show that IAVs detect hallucinations as effectively as CEVs and reveal that answerable and unanswerable prompts are encoded differently as separate classifiers for these categories improved accuracy. However, HalluRAG showed some limited generalizability, advocating for more diversity in datasets on hallucinations.

Summary

The paper presents a novel dataset that leverages recency to train classifiers on detecting closed-domain hallucinations in LLM outputs.
It distinguishes between parametric and contextual knowledge, using LLM internal states for enhanced hallucination detection.
Experimental results show up to 75% test accuracy, with significant gains for unanswerable queries in models like Mistral-7B.

Detecting Closed-Domain Hallucinations in Retrieval-Augmented Generation Systems

The paper "The HalluRAG Dataset: Detecting Closed-Domain Hallucinations in RAG Applications Using an LLM's Internal States" by Fabian Ridder and Malte Schilling presents a detailed paper on the identification of hallucinations in LLMs. These hallucinations refer to instances where LLMs produce outputs that are ungrounded in any training or contextual data, posing significant challenges to the reliability of these models in practical applications.

Overview and Objectives

LLMs have become increasingly capable in generating coherent and contextually relevant text through techniques like retrieval-augmented generation (RAG). However, these systems still frequently generate hallucinations, which are misleading or erroneous outputs that can occur even when trained on extensive datasets. The paper aims to improve hallucination detection by presenting HalluRAG, a dataset specifically developed to train classifiers in recognizing closed-domain hallucinations, focusing on cases where knowledge could not possibly be present in the training data due to its recency.

Methodology

The authors distinguish between different types of knowledge within LLMs: parametric knowledge (information encoded during training) and contextual knowledge (information provided as input at inference time). The HalluRAG dataset is designed to focus on closed-domain hallucinations leveraging the notion of recency to ensure that queries involve information the model could not have encountered during training.

The process of creating HalluRAG involves:

Extracting recent data points from Wikipedia by checking article and reference timestamps to guarantee recency.
Formulating questions and answers from this data using GPT-4o.
Creating RAG prompts that are either answerable (context provided) or unanswerable (no relevant context), ensuring a realistic test of RAG systems.

To train classifiers, the authors utilize various internal state data from LLMs during generation, such as contextualized embedding vectors and intermediate activation values. Multiple configurations of LLMs, including LLaMA-2-7B and Mistral-7B, are employed to train multilayer perceptrons (MLPs) for hallucination detection.

Results

The findings indicate that classifiers trained on HalluRAG achieve test accuracies ranging up to 75%, with notable performance variations depending on the LLM and its quantization level. For instance, the Mistral-7B model achieved higher detection accuracies, suggesting a potential advantage in its internal representation of language or training methodology.

Significant test accuracy improvements were observed when training separate classifiers for answerable and unanswerable questions, with Mistral-7B demonstrating near-perfect accuracy on unanswerable prompts. These results highlight how distinguishing between types of queries can enhance the detection of hallucinations.

Implications and Future Work

The paper's insights stress the importance of diverse and representative training sets for effective hallucination detection. It also underlines the potential of recent data (recency criterion) as a methodological innovation in discerning hallucinations in LLM outputs.

Moving forward, there is ample room for enhancing these detection methods by expanding datasets to include a wider variety of prompts and continually refining LLM architectures to better represent factual and contextual coherence. Moreover, the generalizability of these detection models across different datasets or domains remains an essential focus for future research, as the current models display limited cross-dataset applicability.

In summary, this paper contributes a structured approach to hallucination detection in LLMs, providing both a valuable dataset and methodological insights that could significantly improve the reliability of LLM applications in diverse fields.