Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
81 tokens/sec
Gemini 2.5 Pro Premium
33 tokens/sec
GPT-5 Medium
31 tokens/sec
GPT-5 High Premium
22 tokens/sec
GPT-4o
78 tokens/sec
DeepSeek R1 via Azure Premium
92 tokens/sec
GPT OSS 120B via Groq Premium
436 tokens/sec
Kimi K2 via Groq Premium
209 tokens/sec
2000 character limit reached

GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework (2407.10793v1)

Published 15 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Methods to evaluate LLM responses and detect inconsistencies, also known as hallucinations, with respect to the provided knowledge, are becoming increasingly important for LLM applications. Current metrics fall short in their ability to provide explainable decisions, systematically check all pieces of information in the response, and are often too computationally expensive to be used in practice. We present GraphEval: a hallucination evaluation framework based on representing information in Knowledge Graph (KG) structures. Our method identifies the specific triples in the KG that are prone to hallucinations and hence provides more insight into where in the response a hallucination has occurred, if at all, than previous methods. Furthermore, using our approach in conjunction with state-of-the-art natural language inference (NLI) models leads to an improvement in balanced accuracy on various hallucination benchmarks, compared to using the raw NLI models. Lastly, we explore the use of GraphEval for hallucination correction by leveraging the structure of the KG, a method we name GraphCorrect, and demonstrate that the majority of hallucinations can indeed be rectified.

Citations (4)

Summary

  • The paper introduces GraphEval, a framework that decomposes LLM outputs into knowledge-graph triples for precise hallucination detection.
  • It integrates advanced NLI models with triple verification, enhancing accuracy and providing explainable insights into text inconsistencies.
  • GraphCorrect iteratively refines outputs by correcting detected inconsistencies, achieving higher ROUGE similarity scores and reduced hallucination rates.

Insightful Overview of "GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework"

Authors: Hannah Sansford, Nicholas Richardson, Hermina Petric Maretic, Juba Nait Saada

Introduction and Problem Context

The paper "GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework" addresses a critical challenge in the deployment of LLMs: the detection and correction of hallucinations. Hallucinations occur when LLMs generate outputs that appear plausible but are factually incorrect, even when given correct and constrained context. This issue is especially pertinent in domains demanding high factual accuracy, such as medical diagnosis, thus necessitating advanced and reliable evaluation methods.

Proposed Framework: GraphEval

The authors introduce GraphEval, a sophisticated framework leveraging Knowledge Graphs (KGs) for hallucination detection in LLM-generated text. KGs represent information as triples, facilitating a structured and comprehensive analysis of text. By decomposing LLM outputs into these triples, GraphEval subsequently checks each for consistency against the given context using state-of-the-art Natural Language Inference (NLI) models.

Key Stages of GraphEval:

  1. KG Construction: The target LLM output is transformed into a KG, where entities and their relationships are clearly identified.
  2. Triple Verification: Each triple in the KG is then independently validated for consistency with the provided context using NLI models.

Methodological Insights

The innovation in GraphEval lies in the systematic breakdown of text for granular evaluation, thus providing specific insights into where hallucinations occur. The combined use of KGs and NLI enables a more explainable and potentially more accurate hallucination detection framework than existing approaches.

NLI Model Integration

To assess and validate GraphEval, three prominent NLI-based hallucination detection models are employed:

  • HHEM: A model fine-tuned on a diverse set of datasets for factual consistency.
  • TRUE: Based on the T5-XXL architecture, trained on multiple NLI datasets.
  • TrueTeacher: Incorporates synthetic data to enhance its ability to detect inconsistencies.

Performance of these models, when integrated with GraphEval, is compared against their standalone use across three benchmarks: SummEval, QAGS-C, and QAGS-X. Results indicate notable improvements in balanced accuracy, with GraphEval’s integration particularly beneficial for longer, more complex outputs.

Hallucination Correction: GraphCorrect

In addition to detection, the paper explores hallucination correction through GraphCorrect. This method uses detected inconsistent triples to iteratively correct the LLM output, thereby significantly reducing hallucinations while maintaining high similarity to the original text. This approach is benchmarked using ROUGE metrics, showing higher similarity scores and effective correction rates compared to simpler prompting strategies.

Implications and Future Directions

The practical implication of GraphEval is profound, providing a scalable and explainable framework for improving the reliability of LLM-generated content across various applications. The potential for further enhancement through improved KG construction methodologies and integration with larger, more context-aware LLMs presents a promising avenue for future research.

Conclusion

GraphEval represents a significant advancement in the arena of hallucination detection and correction for LLMs. By leveraging the structured representation of KGs and the analytical rigor of NLI models, it provides a robust framework for enhancing the factual consistency of machine-generated text. Future work can focus on the extension to open-domain settings and refinement of the correcting mechanisms to build upon the noteworthy findings of this research.

References

The paper refers to several prominent sources such as BLEU, ROUGE, BERTScore, and various NLI datasets (FEVER, SNLI, MNLI) which are essential to contextualize and validate the proposed framework. For detailed methodologies and a comprehensive list of references, readers should refer to the original paper.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com