Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization (2503.12937v1)

Published 17 Mar 2025 in cs.AI, cs.CL, cs.CV, and cs.LG

Abstract: Recent studies generally enhance MLLMs' reasoning capabilities via supervised fine-tuning on high-quality chain-of-thought reasoning data, which often leads models to merely imitate successful reasoning paths without understanding what the wrong reasoning paths are. In this work, we aim to enhance the MLLMs' reasoning ability beyond passively imitating positive reasoning paths. To this end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding. Specifically, StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRAR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy. With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning. Extensive experiments over 8 benchmarks demonstrate the superiority of our methods.

Summary

  • The paper presents R1-VL, a novel online reinforcement learning framework using Step-wise Group Relative Policy Optimization (StepGRPO) with two step-wise rewards (StepRAR, StepRVR) to enhance multimodal large language model reasoning beyond supervised fine-tuning imitation.
  • Experiments conducted on eight benchmarks demonstrate that the R1-VL method significantly improves step-by-step reasoning accuracy and consistency compared to methods relying solely on supervised Chain-of-Thought fine-tuning.
  • The StepGRPO framework promotes incremental self-improvement and helps decouple imitation from true reasoning by providing dense, rule-based feedback on intermediate steps, offering broad applicability in multimodal tasks.

Overview

The paper “R1-VL: Learning to Reason with Multimodal LLMs via Step-wise Group Relative Policy Optimization” (2503.12937) presents a novel reinforcement learning (RL) framework aimed at enhancing the reasoning abilities of multimodal LLMs (MLLMs). Rather than relying solely on supervised fine-tuning with chain-of-thought (CoT) data, which often results in mere imitation of successful reasoning paths, the proposed approach leverages online RL to enable models to iteratively self-improve their reasoning performance.

Methodology

The core of the paper is the introduction of Step-wise Group Relative Policy Optimization (StepGRPO). This framework incorporates two novel rule-based reasoning rewards designed for dense and step-wise feedback:

  • Step-wise Reasoning Accuracy Reward (StepRAR): This component employs a soft key-step matching technique to reward intermediate reasoning steps, ensuring that the generated paths include necessary intermediate computations or logical milestones.
  • Step-wise Reasoning Validity Reward (StepRVR): This reward focuses on maintaining a well-structured and logically consistent flow, by evaluating reasoning completeness and overall logical consistency at each step.

The dual-reward mechanism provided by StepRAR and StepRVR allows the RL algorithm to not simply imitate positive examples but also to penalize incorrect or incomplete reasoning paths. This enables more robust learning where the model can discern between valid and flawed reasoning trajectories.

Experimental Evaluation

Experiments were conducted over eight different benchmarks, demonstrating the effectiveness of the StepGRPO framework in enhancing step-by-step reasoning. Key quantitative results include:

  • Strong performance improvements: Across varied benchmarks, the proposed method showed marked improvements in reasoning accuracy and consistency. The dense reward structure helps the model refine its intermediate steps, leading to higher quality final outputs.
  • Comparative analysis: Compared with methods that rely solely on supervised fine-tuning on high-quality CoT data, the R1-VL approach provides enhanced generalization by explicitly shaping the step-wise reasoning process.

These numerical observations underscore the practical efficacy of integrating RL with rule-based rewards in a step-wise manner, suggesting potential improvements in reasoning tasks where intermediate logical steps are critical.

Theoretical and Practical Implications

The paper provides an innovative approach that addresses a key limitation of existing MLLM training paradigms. In particular:

  • Incremental Self-improvement: The online nature of StepGRPO encourages models to incrementally refine their reasoning strategies by providing dense feedback at every step.
  • Decoupling imitation from reasoning: By introducing explicit rewards for reasoning validity and accuracy, R1-VL avoids over-replication of successful outcomes via imitation, enabling models to explore and validate diverse reasoning paths.
  • Application in multimodal contexts: The method’s multimodal capacity suggests that the approach can be generalized across tasks that involve both textual and non-textual inputs, thereby broadening its applicability.

Conclusion

The research presented in “R1-VL: Learning to Reason with Multimodal LLMs via Step-wise Group Relative Policy Optimization” introduces a robust reinforcing mechanism for enhancing reasoning in MLLMs. By leveraging the StepGRPO framework with its dual rewards (StepRAR and StepRVR), the method achieves superior performance over traditional fine-tuning approaches on multiple benchmarks. The detailed experimental results support the claim that the incorporation of step-wise dense feedback can produce significant improvements in both accuracy and logical coherence of generated reasoning paths.

In summary, the paper provides a technically sound framework that redefines the approach toward iterative reasoning in multimodal settings, offering a comprehensive method to self-improve reasoning paths through a carefully calibrated reinforcement learning strategy.

X Twitter Logo Streamline Icon: https://streamlinehq.com