Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation (2305.15852v3)

Published 25 May 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs (large LMs) are susceptible to producing text that contains hallucinated content. An important instance of this problem is self-contradiction, where the LM generates two contradictory sentences within the same context. In this work, we present a comprehensive investigation into self-contradiction for various instruction-tuned LMs, covering evaluation, detection, and mitigation. Our primary evaluation task is open-domain text generation, but we also demonstrate the applicability of our approach to shorter question answering. Our analysis reveals the prevalence of self-contradictions, e.g., in 17.7% of all sentences produced by ChatGPT. We then propose a novel prompting-based framework designed to effectively detect and mitigate self-contradictions. Our detector achieves high accuracy, e.g., around 80% F1 score when prompting ChatGPT. The mitigation algorithm iteratively refines the generated text to remove contradictory information while preserving text fluency and informativeness. Importantly, our entire framework is applicable to black-box LMs and does not require retrieval of external knowledge. Rather, our method complements retrieval-based methods, as a large portion of self-contradictions (e.g., 35.2% for ChatGPT) cannot be verified using online text. Our approach is practically effective and has been released as a push-button tool to benefit the public at https://chatprotect.ai/.

Citations (86)

View on Semantic Scholar

Summary

The paper introduces a novel prompting framework that triggers, detects, and mitigates self-contradictory hallucinations in LLMs.
It employs a three-step methodology using constrained prompts, secondary LLM analysis, and iterative revision for contradiction resolution.
Experimental results demonstrate robust detection (80% F1) and mitigation improvements (up to 89.5%) across models like GPT-4 and ChatGPT.

Analysis of Self-Contradictory Hallucinations in LLMs

The paper "Self-Contradictory Hallucinations of LLMs: Evaluation, Detection and Mitigation" presents an in-depth investigation into the phenomenon of self-contradictory hallucinations produced by LLMs. The authors, affiliated with ETH Zurich, explore the susceptibility of LLMs, like ChatGPT and GPT-4, to generate text containing such hallucinations, specifically focusing on instances where contradicting sentences occur within a single context. The key contribution of this work lies in a novel prompting-based framework designed to trigger, detect, and mitigate these discrepancies.

The prevalent occurrence of self-contradictions in LLM outputs is apparent: for example, ChatGPT was found to produce self-contradictory sentences in 17.7% of cases during open-domain text generation tasks. This highlights a critical issue in LLM applications, challenging their reliability. Beyond evaluation, the paper contributes a practical tool accessible to the public for detecting and mitigating these hallucinations.

Framework and Methodology

The methodology is strategically divided into three steps:

Triggering: Utilizing contextually constrained prompts to coax the LLM into producing pairs of sentences with potential contradictions.
Detection: Applying a secondary model (analyzer LLM) to gauge the presence of contradictions within the generated sentence pairs.
Mitigation: Implementing an iteratively prompted revision strategy aimed at resolving contradictions while maintaining text fluency and informativeness.

Uniquely, the proposed framework operates without reliance on external knowledge retrieval, a common yet cumbersome component in handling LLM hallucinations. Instead, it leverages the logical reasoning capabilities of contemporary LLMs, positing that self-contradictions inherently signal non-factuality, and can thus be resolved within the model's internal reasoning scope.

Experimental Evaluation and Results

The authors conducted extensive tests leveraging four main LLMs: ChatGPT, GPT-4, Llama2-70B-Chat, and Vicuna-13B. The evaluation comprised generating and analyzing open-domain text descriptions of 30 diverse entities from Wikipedia. Results indicated notable self-contradiction frequencies across models, with the greatest prevalence in the less advanced Vicuna-13B. Detection accuracy was robust, achieving approximately 80% F1 scores with practical efficacy in mitigating such contradictions by up to 89.5%.

In practical application, the remediation of self-contradictions maintained the informative to fluent ratio of sentence pairs, showing little increase in perplexity—a measure of fluency and naturalness of text. This suggests the mitigation process did not detract from overall coherence or information content, an impressive outcome highlighting the framework's applicability.

Broader Implications and Future Work

The research underscores the importance of addressing self-contradictions as a subset of hallucinations that impair LLM reliability. Given a significant portion of these contradictions cannot be externally verified, the proposed internal resolution approach presents a substantial advancement in enhancing trustworthiness in LLMs.

Future work could explore extending this framework to handle contradictions across broader contexts within generated outputs. Additionally, the authors suggest fine-tuning open-source models to further improve accuracy in contradiction detection and mitigation, potentially training models to proactively avoid such inconsistencies during initial text generation phases. This research hints at broader implications for LLM deployment in knowledge-sensitive domains, paving the way for more reliable AI systems that necessitate high degrees of factual integrity.

Related Papers

Tweets

https://twitter.com/givinthemvision/status/1762909705540706791

https://twitter.com/nielstron/status/1900807726147453308

https://twitter.com/nielstron/status/1911302226112401562

YouTube

Show All Videos