Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating LLMs' Assessment of Mixed-Context Hallucination Through the Lens of Summarization (2503.01670v1)

Published 3 Mar 2025 in cs.CL, cs.AI, cs.CY, cs.IR, and cs.LG

Abstract: With the rapid development of LLMs, LLM-as-a-judge has emerged as a widely adopted approach for text quality evaluation, including hallucination evaluation. While previous studies have focused exclusively on single-context evaluation (e.g., discourse faithfulness or world factuality), real-world hallucinations typically involve mixed contexts, which remains inadequately evaluated. In this study, we use summarization as a representative task to comprehensively evaluate LLMs' capability in detecting mixed-context hallucinations, specifically distinguishing between factual and non-factual hallucinations. Through extensive experiments across direct generation and retrieval-based models of varying scales, our main observations are: (1) LLMs' intrinsic knowledge introduces inherent biases in hallucination evaluation; (2) These biases particularly impact the detection of factual hallucinations, yielding a significant performance bottleneck; (3) The fundamental challenge lies in effective knowledge utilization, balancing between LLMs' intrinsic knowledge and external context for accurate mixed-context hallucination evaluation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Siya Qi (6 papers)
  2. Rui Cao (65 papers)
  3. Yulan He (113 papers)
  4. Zheng Yuan (117 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com