Evaluating LLMs' Assessment of Mixed-Context Hallucination Through the Lens of Summarization (2503.01670v1)

Published 3 Mar 2025 in cs.CL, cs.AI, cs.CY, cs.IR, and cs.LG

Abstract: With the rapid development of LLMs, LLM-as-a-judge has emerged as a widely adopted approach for text quality evaluation, including hallucination evaluation. While previous studies have focused exclusively on single-context evaluation (e.g., discourse faithfulness or world factuality), real-world hallucinations typically involve mixed contexts, which remains inadequately evaluated. In this study, we use summarization as a representative task to comprehensively evaluate LLMs' capability in detecting mixed-context hallucinations, specifically distinguishing between factual and non-factual hallucinations. Through extensive experiments across direct generation and retrieval-based models of varying scales, our main observations are: (1) LLMs' intrinsic knowledge introduces inherent biases in hallucination evaluation; (2) These biases particularly impact the detection of factual hallucinations, yielding a significant performance bottleneck; (3) The fundamental challenge lies in effective knowledge utilization, balancing between LLMs' intrinsic knowledge and external context for accurate mixed-context hallucination evaluation.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (4)

Siya Qi (6 papers)
Rui Cao (65 papers)
Yulan He (113 papers)
Zheng Yuan (117 papers)

Tweets

https://twitter.com/thinkwee2767/status/1923033422521380972

Evaluating LLMs' Assessment of Mixed-Context Hallucination Through the Lens of Summarization (2503.01670v1)

Related Papers

Tweets