Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
81 tokens/sec
Gemini 2.5 Pro Premium
47 tokens/sec
GPT-5 Medium
22 tokens/sec
GPT-5 High Premium
20 tokens/sec
GPT-4o
88 tokens/sec
DeepSeek R1 via Azure Premium
79 tokens/sec
GPT OSS 120B via Groq Premium
459 tokens/sec
Kimi K2 via Groq Premium
192 tokens/sec
2000 character limit reached

SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs (2508.09584v1)

Published 13 Aug 2025 in cs.CV

Abstract: Despite rapid advances, Large Vision-LLMs (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge, which correspond to faithfulness and factuality hallucinations, respectively. Prior studies primarily evaluate faithfulness hallucination at a coarse level (e.g., object-level) and lack fine-grained analysis. Additionally, existing benchmarks rely on costly manual curation or reused public datasets, raising concerns about scalability and data leakage. To address these limitations, we propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data. We also design a hierarchical hallucination induction framework with input perturbations to simulate realistic noisy scenarios. Integrating these designs, we construct SHALE, a Scalable HALlucination Evaluation benchmark designed to assess both faithfulness and factuality hallucinations via a fine-grained hallucination categorization scheme. SHALE comprises over 30K image-instruction pairs spanning 12 representative visual perception aspects for faithfulness and 6 knowledge domains for factuality, considering both clean and noisy scenarios. Extensive experiments on over 20 mainstream LVLMs reveal significant factuality hallucinations and high sensitivity to semantic perturbations.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.