Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization (2411.10442v2)

Published 15 Nov 2024 in cs.CL and cs.CV

Abstract: Existing open-source multimodal LLMs (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset; and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach enhances the multimodal reasoning abilities of both InternVL2-8B and InternVL2-76B. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10$\times$ larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model are released.

Citations (1)

Summary

  • The paper introduces a novel Mixed Preference Optimization method to enhance multimodal reasoning in LLMs by addressing training and inference distribution shifts.
  • It details a dual approach that leverages an innovative MMPR dataset pipeline alongside blending supervised fine-tuning with preference optimization to mitigate hallucinations.
  • Experimental results show an 8.7-point accuracy improvement on MathVista and competitive performance with larger models, highlighting significant efficiency gains.

Enhancing the Reasoning Ability of Multimodal LLMs via Mixed Preference Optimization

The paper under consideration presents a methodological enhancement for Multimodal LLMs (MLLMs), focusing on improving multimodal reasoning capabilities through the integration of a process termed Mixed Preference Optimization (MPO). This research addresses inherent limitations in existing models, specifically distribution shifts that affect Chain-of-Thought (CoT) reasoning performance.

Methodology Overview

The authors propose a dual approach to address the challenges associated with multimodal reasoning. On the data side, they introduce an innovative pipeline for constructing multimodal preference datasets called MMPR. This involves leveraging automated systems to generate large-scale annotated datasets, crucial for training models on multimodal reasoning tasks. The pipeline caters to both scenarios where clear ground truths are available and where they are not, utilizing techniques such as Dropout Next-Token Prediction (Dropout NTP) for generating less direct, yet still valuable, training data.

On the model side, the development of MPO represents a key contribution. The MPO methodology combines preference optimization with traditional generative training, blending supervised fine-tuning (SFT) loss with preference and quality losses. The preference optimization aspect is inspired by recent advances in NLP, particularly Direct Preference Optimization (DPO). By incorporating these elements, MPO aims to address the distribution shift between training and inference phases, crucially enhancing the models’ ability to produce coherent and contextually appropriate CoT responses.

Experimental Results

The effectiveness of the MPO approach is showcased through extensive evaluations across various benchmarks. The authors report notable performance improvements—most distinctly in multimodal reasoning. For instance, their model InternVL2-8B-MPO demonstrates a significant 8.7-point accuracy improvement on the MathVista benchmark compared to its baseline InternVL2-8B, and its performance rivals that of models ten times its size, such as InternVL2-76B.

Additionally, the application of MPO demonstrates a reduction in hallucinations, a common issue in multimodal models, as evidenced by improvements in the POPE and MMHalBench benchmarks. These results underscore the algorithm’s effectiveness not only in reasoning tasks but also in improving the overall reliability and robustness of multimodal systems.

Implications and Future Directions

The theoretical implications of this research are profound, suggesting that systematically integrating preference optimization techniques with standard model training processes can substantially mitigate issues around response distribution shifts. The practical impact is equally significant; it enables more efficient resource utilization by achieving competitive performance with relatively smaller models, thus lowering the computational and environmental costs of deploying powerful AI solutions.

Looking forward, this work opens avenues for refining preference optimization methods, potentially exploring other dimensions of model training and inference discrepancies. Given the dataset's release and the open-source nature of the research, it serves as a foundation for future explorations into enhancing MLLM capabilities and addressing more comprehensive multimodal challenges.

In summary, this paper presents a significant step forward in strengthening the reasoning capacity of MLLMs by effectively leveraging preference-based methodologies, delivering improvements that align closer to human-like reasoning and understanding in multimodal contexts. This contributes to the broader goal of forging more intelligent and versatile AI systems capable of nuanced and contextually aware reasoning across complex, multimodal inputs.