Aligning Modalities in Vision LLMs via Preference Fine-tuning
The paper "Aligning Modalities in Vision LLMs via Preference Fine-tuning" addresses a critical challenge in the function of Vision LLMs (VLLMs): the tendency to generate hallucinatory content not grounded in the provided images. Given the significant advancements in VLLMs, which integrate pre-trained vision and LLMs to execute tasks like image captioning and vision-question answering, this misalignment in representations often presents complications, particularly in high-stakes domains like medical diagnostics or autonomous driving.
Problem Context
VLLMs combine the capabilities of pre-trained vision and LLMs. However, the independent training of these components necessitates additional joint training on image-language pairs to achieve alignment. Current methods to achieve this alignment encounter issues such as hallucinations, where the LLM generates visual content that is inconsistent with the real image input, despite high-quality visual features being output from the vision model and factual language output from the LLM.
The hallucination issue is primarily attributed to misalignment between the image and text modalities. This misalignment prompts the model to rely excessively on prior commonsense knowledge or stereotypes embedded in training data, rather than the actual image data.
Proposed Method: POVID
To address this, the authors introduce a method termed Preference Optimization in VLLMs with AI-Generated Dispreferences (POVID). The method leverages a structured preference-tuning framework that eschews human-labeled data in favor of AI-generated feedback. POVID involves two primary stages:
- Hallucination Induction via Textual Response Manipulation: Groundtruth answers are selected as preferred responses. Using GPT-4V, hallucinated dispreferred responses are generated by introducing plausible, minor hallucinations into these correct answers. This includes altering object co-occurrences, logical relationships, and entity attributes, thus providing juxtaposition between factual and hallucinatory outputs.
- Inducing Hallucinations via Image Distortion: The researchers provoke inherent hallucinations by introducing noise to images, causing the VLLM's reliance on textual prior rather than visual input, thereby highlighting and correcting the model's default errors in processing noisy image inputs.
By integrating these generated dispreferences into a Direct Preference Optimization (DPO) framework, the method allows for systematic correction and alignment of vision and language modalities.
Findings and Implications
The paper details substantive performance improvements across extensive hallucination and comprehensive evaluation benchmarks. Results indicate a significant reduction in hallucinatory outputs compared to baseline VLLM models and other preference tuning methods, with a reported average improvement of 12.4% in reducing hallucinations. Additionally, performance gains across tasks such as detailed image description and visual question answering were observed, illustrating broader applicability beyond reducing hallucinations alone.
From a theoretical perspective, this approach underscores the potential of augmenting multimodal learning architectures with AI-generated dispreferences for modality alignment, which could be a promising direction for other multimodal AI systems experiencing similar integration challenges.
Future Directions
Although the POVID framework shows promise in aligning modalities and reducing hallucination, future exploration could focus on the adaptability of this preference-tuning technique to other multimodal contexts beyond vision-language interactions. It also prompts further investigation into the scalability of using AI-generated dispreferences across various model architectures and domains, potentially unlocking new efficiencies in AI model training paradigms.
Overall, this paper marks a significant step toward enhancing the fidelity and reliability of multimodal AI systems by addressing the persistent problem of hallucination through methodical preference fine-tuning. Such advancements herald important implications for deploying VLLMs in practical applications, ensuring more dependable user interactions and interpretations.