Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning (2503.07365v2)

Published 10 Mar 2025 in cs.CV

Abstract: DeepSeek R1, and o1 have demonstrated powerful reasoning capabilities in the text domain through stable large-scale reinforcement learning. To enable broader applications, some works have attempted to transfer these capabilities to multimodal reasoning. However, these efforts have been limited by the limited difficulty of selected tasks and relatively small training scales, making it challenging to demonstrate strong multimodal reasoning abilities. To address this gap, we introduce the MMK12 dataset and MM-EUREKA with 7B and 32B parameters. The former is a high-quality multimodal mathematics reasoning dataset featuring diverse knowledge domains with human-verified answers and solution processes. The latter is a multimodal model employing rule-based reinforcement learning on MMK12, utilizing online filtering and two-stage training strategy to enhance training stability. MM-EUREKA demonstrates remarkable performance gains in multimodal mathematical reasoning, outperforming previous powerful models like InternVL2.5-78B or InternVL2.5-38B-MPO. In particular, MM-EUREKA achieves competitive or superior performance compared to both open-source and closed-source models, and trails slightly behind o1 in multidisciplinary reasoning tasks. We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at https://github.com/ModalMinds/MM-EUREKA

Summary

  • The paper introduces MM-Eureka, a rule-based reinforcement learning framework successfully applying large-scale RL techniques from LLMs to achieve robust multimodal reasoning without supervised fine-tuning.
  • Experiments show MM-Eureka models, like MM-Eureka-Zero-38B, achieve significant performance gains (e.g., 8.2% accuracy on K12 benchmark) and exhibit visual "aha moments," demonstrating effectiveness with minimal data (e.g., 8K data for 38B model).
  • The MM-Eureka framework and findings suggest that simple rule-based RL and careful data filtering are highly data-efficient for training multimodal reasoning skills, potentially surpassing methods like SFT and MPO with less data.

The paper introduces MM-Eureka, a multimodal reasoning model extending rule-based reinforcement learning (RL) to multimodal domains, mirroring key characteristics of text-based RL systems like DeepSeek-R1. The core idea revolves around reproducing the gains observed in LLMs through large-scale RL in multimodal scenarios, which has been challenging to date. The authors demonstrate that instruction-tuned and pre-trained models can acquire robust multimodal reasoning skills via rule-based RL, without relying on supervised fine-tuning (SFT). They open-source the complete pipeline, including code, models, and data.

The paper emphasizes the challenges in transferring large-scale RL techniques from LLMs to multimodal contexts. Prior attempts, such as R1-V and R1-Multimodal-Journey, failed to reproduce key aspects of DeepSeek-R1, like consistent increases in response length and accuracy reward. While LMM-R1 showed some success, it lacked verification in large-scale image-text data training. The authors aim to address these gaps by investigating the effectiveness of large-scale RL in multimodal reasoning.

The authors present MM-Eureka-8B and MM-Eureka-Zero-38B, trained using InternVL2.5-Instruct-8B and InternVL2.5-Pretrained-38B, respectively. The models were evaluated on MathVista, MathVerse, MathVision, OlympiadBench, and a manually collected K12 math test set. MM-Eureka uses 54K image-text data for rule-based RL and surpasses the performance of models trained with 1M data using MPO. MM-Eureka-Zero applies rule-based RL with only 8K image-text math reasoning data, outperforming the instruct model trained with 16.3M data on OlympiadBench and demonstrating comparable performance on MathVerse.

Key contributions include:

  • A multimodal large-scale reinforcement learning framework based on OpenRLHF, supporting models like InternVL and various RL algorithms.
  • The MM-Eureka-8B and MM-Eureka-Zero-38B models, which exhibit visual "aha moments" and achieve steady gains in accuracy reward and response length.
  • Experiments showing that simple rule-based RL is more data-efficient than post-training approaches like MPO and SFT. MM-Eureka-Zero-38B demonstrates superior performance with an 8.2% accuracy improvement on the K12 benchmark, using only 0.05% of the training data compared to the instruct model.

The methodology section details the setup, using InternVL2.5 as the base model. The RL algorithm mirrors DeepSeek-R1, employing rule-based format rewards rformat{0,1}r_\text{format} \in \{0, 1\} and accuracy rewards raccuracy{0,1}r_\text{accuracy} \in \{0, 1\}. The multimodal input RL framework is built on OpenRLHF, supporting different model sizes and RL algorithms.

The dataset construction and cleaning process are described. The dataset includes open-source data and manually collected visual questions and answers at the K-12 level. The data is filtered to remove low-quality samples and problems without clear answers, ensuring stable rule-based RL training.

The reward function uses accuracy and format rewards. The final reward is defined as:

r=raccuracy+λrformatr = r_{\text{accuracy}} + \lambda r_{\text{format}}

where:

  • raccuracyr_{\text{accuracy}} is the accuracy reward.
  • rformatr_{\text{format}} is the format reward.
  • λ\lambda is a scaling coefficient.

The REINFORCE Leave-One-Out (RLOO) algorithm is used for advantage estimation and policy update. For each query x\bold{x}, the model generates KK responses {y(1),y(2),,y(K)}\{\bold{y}^{(1)}, \bold{y}^{(2)}, \cdots, \bold{y}^{(K)}\}. Each query-response pair {x,y(i)}\{\bold{x}, \bold{y}^{(i)}\} receives a score r(i)r^{(i)} determined by the rule-based reward function. The advantage estimator is computed as:

A(i)=r(i)1K1jir(j),i=1,,KA^{(i)} = r^{(i)} - \frac{1}{K-1}\sum_{j\neq i} r^{(j)}, \quad i=1,\cdots,K

The actor loss adopts a PPO-clip loss:

$J_\text{PPO}(\theta) = -\mathbb{E}_{t}\Big[ \min\Big( \frac{\pi_{\theta}(y_t^{(i)}|\bold{x}, \bold{y}^{(i)}_{<t})}{\pi_{\theta_\text{old}(y_t^{(i)}|\bold{x}, \bold{y}^{(i)}_{<t})}A^{(i)},\text{clip}\big(\frac{\pi_{\theta}(y_t^{(i)}|\bold{x}, \bold{y}^{(i)}_{<t})}{\pi_{\theta_\text{old}(y_t^{(i)}|\bold{x}, \bold{y}^{(i)}_{<t})}, 1-\epsilon, 1+\epsilon\big) A^{(i)}\Big) \Big]$

where:

  • ϵ\epsilon is the clipping parameter.
  • πθ\pi_\theta is the current policy.
  • πθold\pi_{\theta_\text{old}} is the old policy before the update.

The KL divergence loss between the policy πθ\pi_\theta and the reference policy πref\pi_\text{ref} is added as a regularization term:

J(θ)=JPPO(θ)+αKLDKL(πθ,πref)J(\theta) = J_\text{PPO}(\theta) + \alpha_\text{KL} \mathcal{D}_\text{KL}(\pi_\theta, \pi_\text{ref})

where:

  • αKL\alpha_\text{KL} is the weight parameter.

In practice, the weight αKL\alpha_\text{KL} is often set to 0.

Key findings include:

  • Data filtering is crucial for stable RL training.
  • The simplest RL training setups are sufficient.
  • Visual "aha moments" can be observed, where the model rechecks intermediate steps for clues from the image.

The experiments section details the training process, configurations, and performance analysis. The models, based on InternVL2.5-Instruct and InternVL2.5-Pretrain, are trained in 8B and 38B sizes. The authors evaluate the models using the benchmarks, adopting greedy decoding with a temperature of 0.

For the instruct model, the system prompt is retained, and format-related information is included in the user prompt. For the base model, format information is provided within the system prompt. Different weights are assigned to the format reward for the two models.

The baselines include SFT, COT SFT, and MPO. The results show that with only 54K training samples, the model shows improvements across all benchmarks compared to the instruct model.

For the pre-trained model, rule-based RL is conducted on InternVL-Pretrain models with 8B and 38B. The 38B model exhibits a clear training trend, with increasing response length and reasoning depth.

The discussion section covers methods that were expected to be effective but failed in the experiments, including curriculum learning and online data filtering. The authors also discuss the impact of model size, noting that small models (e.g., 8B) struggle to maintain stable rule-based RL training compared to larger models (e.g., 38B) in multimodal mathematical reasoning scenarios.