- The paper introduces a novel Mixed Preference Optimization method to enhance multimodal reasoning in LLMs by addressing training and inference distribution shifts.
- It details a dual approach that leverages an innovative MMPR dataset pipeline alongside blending supervised fine-tuning with preference optimization to mitigate hallucinations.
- Experimental results show an 8.7-point accuracy improvement on MathVista and competitive performance with larger models, highlighting significant efficiency gains.
Enhancing the Reasoning Ability of Multimodal LLMs via Mixed Preference Optimization
The paper under consideration presents a methodological enhancement for Multimodal LLMs (MLLMs), focusing on improving multimodal reasoning capabilities through the integration of a process termed Mixed Preference Optimization (MPO). This research addresses inherent limitations in existing models, specifically distribution shifts that affect Chain-of-Thought (CoT) reasoning performance.
Methodology Overview
The authors propose a dual approach to address the challenges associated with multimodal reasoning. On the data side, they introduce an innovative pipeline for constructing multimodal preference datasets called MMPR. This involves leveraging automated systems to generate large-scale annotated datasets, crucial for training models on multimodal reasoning tasks. The pipeline caters to both scenarios where clear ground truths are available and where they are not, utilizing techniques such as Dropout Next-Token Prediction (Dropout NTP) for generating less direct, yet still valuable, training data.
On the model side, the development of MPO represents a key contribution. The MPO methodology combines preference optimization with traditional generative training, blending supervised fine-tuning (SFT) loss with preference and quality losses. The preference optimization aspect is inspired by recent advances in NLP, particularly Direct Preference Optimization (DPO). By incorporating these elements, MPO aims to address the distribution shift between training and inference phases, crucially enhancing the models’ ability to produce coherent and contextually appropriate CoT responses.
Experimental Results
The effectiveness of the MPO approach is showcased through extensive evaluations across various benchmarks. The authors report notable performance improvements—most distinctly in multimodal reasoning. For instance, their model InternVL2-8B-MPO demonstrates a significant 8.7-point accuracy improvement on the MathVista benchmark compared to its baseline InternVL2-8B, and its performance rivals that of models ten times its size, such as InternVL2-76B.
Additionally, the application of MPO demonstrates a reduction in hallucinations, a common issue in multimodal models, as evidenced by improvements in the POPE and MMHalBench benchmarks. These results underscore the algorithm’s effectiveness not only in reasoning tasks but also in improving the overall reliability and robustness of multimodal systems.
Implications and Future Directions
The theoretical implications of this research are profound, suggesting that systematically integrating preference optimization techniques with standard model training processes can substantially mitigate issues around response distribution shifts. The practical impact is equally significant; it enables more efficient resource utilization by achieving competitive performance with relatively smaller models, thus lowering the computational and environmental costs of deploying powerful AI solutions.
Looking forward, this work opens avenues for refining preference optimization methods, potentially exploring other dimensions of model training and inference discrepancies. Given the dataset's release and the open-source nature of the research, it serves as a foundation for future explorations into enhancing MLLM capabilities and addressing more comprehensive multimodal challenges.
In summary, this paper presents a significant step forward in strengthening the reasoning capacity of MLLMs by effectively leveraging preference-based methodologies, delivering improvements that align closer to human-like reasoning and understanding in multimodal contexts. This contributes to the broader goal of forging more intelligent and versatile AI systems capable of nuanced and contextually aware reasoning across complex, multimodal inputs.