MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning (2505.24871v2)

Published 30 May 2025 in cs.CV, cs.CL, and cs.LG

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for post-training LLMs, achieving state-of-the-art performance on tasks with structured, verifiable answers. Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but is complicated by the broader, heterogeneous nature of vision-language tasks that demand nuanced visual, logical, and spatial capabilities. As such, training MLLMs using RLVR on multiple datasets could be beneficial but creates challenges with conflicting objectives from interaction among diverse datasets, highlighting the need for optimal dataset mixture strategies to improve generalization and reasoning. We introduce a systematic post-training framework for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation and benchmark implementation. Specifically, (1) We developed a multimodal RLVR framework for multi-dataset post-training by curating a dataset that contains different verifiable vision-language problems and enabling multi-domain online RL learning with different verifiable rewards; (2) We proposed a data mixture strategy that learns to predict the RL fine-tuning outcome from the data mixture distribution, and consequently optimizes the best mixture. Comprehensive experiments showcase that multi-domain RLVR training, when combined with mixture prediction strategies, can significantly boost MLLM general reasoning capacities. Our best mixture improves the post-trained model's accuracy on out-of-distribution benchmarks by an average of 5.24% compared to the same model post-trained with uniform data mixture, and by a total of 20.74% compared to the pre-finetuning baseline.

Summary

The paper introduces MoDoMoDo, a systematic RLVR framework using multi-domain data mixtures to post-train MLLMs for improved generalization.
Their framework employs an optimized data mixture strategy that improves out-of-distribution accuracy by 5.24% on average compared to uniform mixing.
The findings suggest strategic data mixing in RLVR is crucial for enhancing MLLM reasoning and generalization across varied multimodal domains.

Multimodal LLM Reinforcement Learning with Multi-Domain Data Mixtures

The paper entitled "MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning" introduces a systematic approach for the post-training of Multimodal LLMs (MLLMs), leveraging Reinforcement Learning with Verifiable Rewards (RLVR). The authors address the inherent complexity of vision-language tasks by proposing a framework that incorporates multiple datasets to enhance the general reasoning capability of MLLMs across diverse domains. This work is notable for its objective formulation of a data mixture strategy within RLVR to improve generalization and reasoning performance.

Key Contributions

The authors present several critical contributions. Firstly, they develop a multimodal RLVR framework designed for multi-dataset post-training, which involves curating five distinct image-text datasets containing verifiable vision-language problems. This approach is facilitated by enabling multi-domain online RL learning using different verifiable rewards. Secondly, the framework employs a data mixture strategy to optimize the dataset combination, thereby predicting the RL fine-tuning outcome from the data mixture distribution. Experimental results demonstrate the effectiveness of the proposed framework: the best mixture strategy improves the post-trained model's accuracy on out-of-distribution benchmarks by an average of 5.24% compared to uniformly mixed data, and by a total of 20.74% compared to the pre-finetuning baseline.

Methodology

The framework is defined through robust data mixture modeling for RLVR, where reinforcement learning is transformed into a bi-level optimization problem. The authors frame the problem as optimizing the data mixture strategy to improve learning outcomes across diverse multimodal capabilities. This involves creating surrogate models that predict performance based on varying data distributions, allowing efficient identification of mixture strategies that yield optimal model training results.

Numerical Results and Insights

Comprehensive experiments conducted by the authors reveal that multi-domain RLVR training combined with optimized mixture prediction strategies can significantly boost MLLM reasoning capacities. Notably, experimenting within varied mixture strategies shows that weighted or optimized mixing of datasets can yield performance gains over training with uniform or singular dataset strategies. This nuanced dataset interaction highlights the importance of intelligent data mixture strategies in training models with multimodal complexities.

Practical and Theoretical Implications

The practical implications of this research are significant, as they suggest that adopting systematic data mixture strategies in RLVR can lead to improved generalization and performance of MLLMs across diversified tasks. Theoretically, the paper contributes an analytical framework for understanding how data distribution mixtures influence learning outcomes, offering a foundation for exploring scalable methods in AI training.

Future Perspectives

Looking forward, the methodology could extend beyond image-text modalities to incorporate audio, video, and 3-D data domains. Additionally, research could further refine surrogate models that incorporate dataset similarity or curriculum schedules, which may reduce training costs and increase model efficacy. This exploration of mixture-optimized RLVR approaches offers substantial promise for enhancing multimodal AI systems' reasoning abilities and achieving versatile applications.

In summary, the paper provides valuable insights into advancing MLLMs through strategic RLVR and data mixture optimization, paving the way for more capable and adaptable artificial intelligence systems across various multimodal tasks.

Related Papers

Tweets

https://twitter.com/YiqingLiang2/status/1930312519928737820

https://twitter.com/CSE493g_BOT/status/1930369091354697835