Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges (2311.03287v2)

Published 6 Nov 2023 in cs.LG, cs.CL, and cs.CV

Abstract: While GPT-4V(ision) impressively models both visual and textual information simultaneously, it's hallucination behavior has not been systematically assessed. To bridge this gap, we introduce a new benchmark, namely, the Bias and Interference Challenges in Visual LLMs (Bingo). This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual LLMs: bias and interference. Here, bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data. Interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented. We identify a notable regional bias, whereby GPT-4V(ision) is better at interpreting Western images or images with English writing compared to images from other countries or containing text in other languages. Moreover, GPT-4V(ision) is vulnerable to leading questions and is often confused when interpreting multiple images together. Popular mitigation approaches, such as self-correction and chain-of-thought reasoning, are not effective in resolving these challenges. We also identified similar biases and interference vulnerabilities with LLaVA and Bard. Our results characterize the hallucination challenges in GPT-4V(ision) and state-of-the-art visual-LLMs, and highlight the need for new solutions. The Bingo benchmark is available at https://github.com/gzcch/Bingo.

PDF Abstract

Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges

In the paper titled "Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges," the authors present a comprehensive evaluation of hallucination phenomena observed in large vision-LLMs (VLMs), with a particular focus on GPT-4V(ision). The paper introduces a specialized benchmark named Bingo, designed to systematically assess two prevalent types of hallucinations—bias and interference—within these models.

The paper identifies bias as a tendency for the model to hallucinate certain types of responses due to imbalances in the training data. Notably, GPT-4V(ision) exhibits regional bias by performing better at interpreting Western images in English compared to those from other regions or languages. This is quantified further where the model shows enhanced performance on Western-origin images but considerably lower accuracy on images from regions such as East Asia and Africa, highlighting a clear sociocultural skew.

Moreover, OCR bias emerges when dealing with multilingual text in images, where GPT-4V(ision) struggles with languages other than English or French, likely due to limitations in OCR detectors that introduce linguistic biases. For factual bias, the model sometimes relies excessively on learned factual knowledge, disregarding counterfactual elements present in images, which significantly deteriorates performance when processing input images that contradict this prior knowledge.

Interference refers to instances where the model's judgment is disrupted, leading to hallucination. Two types of interference are emphasized: image-to-image and text-to-image interference. Image-to-image interference manifests when GPT-4V(ision) interprets multiple similar images presented together, often resulting in confusion or incorrect object detection. Text-to-image interference occurs when human user's claims within text prompts override image content comprehension, an issue analogous to 'sycophancy' seen in traditional LLMs, where models' responses align with user statements regardless of visual evidence.

The authors conducted empirical analyses comparing GPT-4V(ision) with other VLMs like LLaVA-1.5 and Bard. LLaVA-1.5 showed considerably lower performance, especially in region and OCR biases, whereas Bard demonstrated better handling of OCR challenges, although it still fell short in interference scenarios compared to GPT-4V(ision).

Attempts to mitigate hallucinations include mechanisms like self-correction and Chain-of-Thought (CoT) reasoning. Although self-correction aids in reducing error instances, it does not fully address intrinsic bias and interference issues. CoT reasoning fails to significantly enhance image processing, possibly due to its design for improving language reasoning rather than visual analyses.

This paper calls to attention the necessity for developing novel solutions aimed specifically at addressing biases and interference inherent in VLMs. It advances the understanding of reliability within GPT-4V(ision) and other models, suggesting future directions for research include broader training datasets encompassing diverse regions and languages, along with algorithms capable of distinguishing actual image content from deliberate or unintended textual interferences.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Chenhang Cui (14 papers)
Yiyang Zhou (33 papers)
Xinyu Yang (109 papers)
Shirley Wu (12 papers)
Linjun Zhang (70 papers)
James Zou (232 papers)
Huaxiu Yao (103 papers)

Citations (73)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - gzcch/Bingo (53 stars)