Aligning Modalities in Vision Large Language Models via Preference Fine-tuning (2402.11411v1)

Published 18 Feb 2024 in cs.LG, cs.CL, and cs.CV

Abstract: Instruction-following Vision LLMs (VLLMs) have achieved significant progress recently on a variety of tasks. These approaches merge strong pre-trained vision models and LLMs. Since these components are trained separately, the learned representations need to be aligned with joint training on additional image-language pairs. This procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core LLM is highly factual and the vision backbone has sufficiently complete representations. In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning. Specifically, we propose POVID to generate feedback data with AI models. We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data. First, we prompt GPT-4V to inject plausible hallucinations into the correct answer. Second, we distort the image to trigger the inherent hallucination behavior of the VLLM. This is an automated approach, which does not rely on human data generation or require a perfect expert, which makes it easily scalable. Finally, both of these generation strategies are integrated into an RLHF pipeline via Direct Preference Optimization. In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches. Our data and code are available at https://github.com/YiyangZhou/POVID.

PDF Abstract

Aligning Modalities in Vision LLMs via Preference Fine-tuning

The paper "Aligning Modalities in Vision LLMs via Preference Fine-tuning" addresses a critical challenge in the function of Vision LLMs (VLLMs): the tendency to generate hallucinatory content not grounded in the provided images. Given the significant advancements in VLLMs, which integrate pre-trained vision and LLMs to execute tasks like image captioning and vision-question answering, this misalignment in representations often presents complications, particularly in high-stakes domains like medical diagnostics or autonomous driving.

Problem Context

VLLMs combine the capabilities of pre-trained vision and LLMs. However, the independent training of these components necessitates additional joint training on image-language pairs to achieve alignment. Current methods to achieve this alignment encounter issues such as hallucinations, where the LLM generates visual content that is inconsistent with the real image input, despite high-quality visual features being output from the vision model and factual language output from the LLM.

The hallucination issue is primarily attributed to misalignment between the image and text modalities. This misalignment prompts the model to rely excessively on prior commonsense knowledge or stereotypes embedded in training data, rather than the actual image data.

Proposed Method: POVID

To address this, the authors introduce a method termed Preference Optimization in VLLMs with AI-Generated Dispreferences (POVID). The method leverages a structured preference-tuning framework that eschews human-labeled data in favor of AI-generated feedback. POVID involves two primary stages:

Hallucination Induction via Textual Response Manipulation: Groundtruth answers are selected as preferred responses. Using GPT-4V, hallucinated dispreferred responses are generated by introducing plausible, minor hallucinations into these correct answers. This includes altering object co-occurrences, logical relationships, and entity attributes, thus providing juxtaposition between factual and hallucinatory outputs.
Inducing Hallucinations via Image Distortion: The researchers provoke inherent hallucinations by introducing noise to images, causing the VLLM's reliance on textual prior rather than visual input, thereby highlighting and correcting the model's default errors in processing noisy image inputs.

By integrating these generated dispreferences into a Direct Preference Optimization (DPO) framework, the method allows for systematic correction and alignment of vision and language modalities.

Findings and Implications

The paper details substantive performance improvements across extensive hallucination and comprehensive evaluation benchmarks. Results indicate a significant reduction in hallucinatory outputs compared to baseline VLLM models and other preference tuning methods, with a reported average improvement of 12.4% in reducing hallucinations. Additionally, performance gains across tasks such as detailed image description and visual question answering were observed, illustrating broader applicability beyond reducing hallucinations alone.

From a theoretical perspective, this approach underscores the potential of augmenting multimodal learning architectures with AI-generated dispreferences for modality alignment, which could be a promising direction for other multimodal AI systems experiencing similar integration challenges.

Future Directions

Although the POVID framework shows promise in aligning modalities and reducing hallucination, future exploration could focus on the adaptability of this preference-tuning technique to other multimodal contexts beyond vision-language interactions. It also prompts further investigation into the scalability of using AI-generated dispreferences across various model architectures and domains, potentially unlocking new efficiencies in AI model training paradigms.

Overall, this paper marks a significant step toward enhancing the fidelity and reliability of multimodal AI systems by addressing the persistent problem of hallucination through methodical preference fine-tuning. Such advancements herald important implications for deploying VLLMs in practical applications, ensuring more dependable user interactions and interpretations.