Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding (2311.16922v1)

Published 28 Nov 2023 in cs.CV, cs.AI, and cs.CL

Abstract: Large Vision-LLMs (LVLMs) have advanced considerably, intertwining visual recognition and language understanding to generate content that is not only coherent but also contextually attuned. Despite their success, LVLMs still suffer from the issue of object hallucinations, where models generate plausible yet incorrect outputs that include objects that do not exist in the images. To mitigate this issue, we introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations. This adjustment ensures the generated content is closely grounded to visual inputs, resulting in contextually accurate outputs. Our experiments show that VCD, without either additional training or the usage of external tools, significantly mitigates the object hallucination issue across different LVLM families. Beyond mitigating object hallucinations, VCD also excels in general LVLM benchmarks, highlighting its wide-ranging applicability.

PDF HTML Abstract

Mitigating Object Hallucinations in Large Vision-LLMs through Visual Contrastive Decoding

The research paper titled "Mitigating Object Hallucinations in Large Vision-LLMs through Visual Contrastive Decoding" addresses a critical challenge facing Large Vision-LLMs (LVLMs): the tendency to generate object hallucinations. These hallucinations, which present objects that do not exist within the image, undermine the reliability and trustworthiness of LVLM outputs. The paper proposes an innovative method called Visual Contrastive Decoding (VCD), which is both simple and training-free, to mitigate this issue effectively.

LVLMs have demonstrated significant progress in merging visual recognition with linguistic understanding, enabling them to produce coherent and contextually relevant content. Despite their advancements, the problem of object hallucinations remains persistent. The phenomenon arises due to an over-reliance on statistical biases within the training data and the influence of unimodal priors embedded in LLMs. The paper identifies these as the primary causes of hallucinations.

The proposed technique, VCD, aims to contrast output distributions derived from both original and distorted visual inputs. By introducing controlled visual uncertainty—in this case, a Gaussian noise mask—the method amplifies the model's reliance on these statistical biases and language priors, effectively flagging potential areas for hallucination. VCD operates by subtracting the logits obtained from distorted inputs from those obtained from original inputs. This contrastive approach adjusts the model's output bias, ensuring it is more closely grounded to the visual inputs.

Experiments showcase the efficacy of VCD across several LVLM families, including LLAVA-1.5, InstructBLIP, and Qwen-VL. It markedly reduces object hallucination without additional training data or external tools. For instance, the method demonstrates up to a 7.4-point F1 score improvement on specific benchmarks, which highlights its substantial impact. Beyond hallucination reduction, VCD also contributes positively to general LVLM benchmarking, indicating its broad applicability in enhancing visual-linguistic coherence.

The implications of this research are manifold. Practically, it bolsters the deployment of LVLMs in fields demanding high precision and reliability, such as healthcare, autonomous vehicles, and robotics. Theoretically, it advances our understanding of hallucination phenomena in multimodal models and suggests pathways for future research to explore further the integration of cross-modal learning paradigms.

In conclusion, while LVLMs are at the frontier of AI models for visual and language data integration, addressing their hallucination-related limitations is crucial. The VCD approach constitutes a promising development in this regard. Future research may investigate varied distortion techniques and extend the approach to encompass a wider range of multimodal models or incorporate dynamic contextual cues to further enhance the robustness and reliability of LVLM outputs.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Sicong Leng (15 papers)
Hang Zhang (164 papers)
Guanzheng Chen (9 papers)
Xin Li (980 papers)
Shijian Lu (151 papers)
Chunyan Miao (145 papers)
Lidong Bing (144 papers)

Citations (120)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - DAMO-NLP-SG/VCD: Official implementation for the paper "Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding" (118 stars)