Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
In the paper titled "Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges," the authors present a comprehensive evaluation of hallucination phenomena observed in large vision-LLMs (VLMs), with a particular focus on GPT-4V(ision). The paper introduces a specialized benchmark named Bingo, designed to systematically assess two prevalent types of hallucinations—bias and interference—within these models.
The paper identifies bias as a tendency for the model to hallucinate certain types of responses due to imbalances in the training data. Notably, GPT-4V(ision) exhibits regional bias by performing better at interpreting Western images in English compared to those from other regions or languages. This is quantified further where the model shows enhanced performance on Western-origin images but considerably lower accuracy on images from regions such as East Asia and Africa, highlighting a clear sociocultural skew.
Moreover, OCR bias emerges when dealing with multilingual text in images, where GPT-4V(ision) struggles with languages other than English or French, likely due to limitations in OCR detectors that introduce linguistic biases. For factual bias, the model sometimes relies excessively on learned factual knowledge, disregarding counterfactual elements present in images, which significantly deteriorates performance when processing input images that contradict this prior knowledge.
Interference refers to instances where the model's judgment is disrupted, leading to hallucination. Two types of interference are emphasized: image-to-image and text-to-image interference. Image-to-image interference manifests when GPT-4V(ision) interprets multiple similar images presented together, often resulting in confusion or incorrect object detection. Text-to-image interference occurs when human user's claims within text prompts override image content comprehension, an issue analogous to 'sycophancy' seen in traditional LLMs, where models' responses align with user statements regardless of visual evidence.
The authors conducted empirical analyses comparing GPT-4V(ision) with other VLMs like LLaVA-1.5 and Bard. LLaVA-1.5 showed considerably lower performance, especially in region and OCR biases, whereas Bard demonstrated better handling of OCR challenges, although it still fell short in interference scenarios compared to GPT-4V(ision).
Attempts to mitigate hallucinations include mechanisms like self-correction and Chain-of-Thought (CoT) reasoning. Although self-correction aids in reducing error instances, it does not fully address intrinsic bias and interference issues. CoT reasoning fails to significantly enhance image processing, possibly due to its design for improving language reasoning rather than visual analyses.
This paper calls to attention the necessity for developing novel solutions aimed specifically at addressing biases and interference inherent in VLMs. It advances the understanding of reliability within GPT-4V(ision) and other models, suggesting future directions for research include broader training datasets encompassing diverse regions and languages, along with algorithms capable of distinguishing actual image content from deliberate or unintended textual interferences.