LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL (2503.07536v2)

Published 10 Mar 2025 in cs.CL and cs.AI

Abstract: Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \textbf{LMM-R1}, a two-stage framework adapting rule-based RL for multimodal reasoning through \textbf{Foundational Reasoning Enhancement (FRE)} followed by \textbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that LMM-R1 achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.

Summary

Analysis of "LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL"

The paper introduces LMM-R1, a novel framework aimed at enhancing reasoning capabilities in Large Multimodal Models (LMMs), particularly focusing on models with a 3 billion parameter scale. The authors tackle the intricate challenge of improving reasoning within compact LMM architectures constrained by their intrinsic capacity limitations and the complex interplay between visual perception and logical reasoning.

Key Contributions and Methodology

Two-Stage Training Approach:
- Foundational Reasoning Enhancement (FRE): This stage utilizes rule-based reinforcement learning (RL) on abundant, high-quality text-only data to fortify the model's reasoning abilities. The authors argue that by focusing initially on text-based tasks, the model establishes a sound reasoning foundation without the costly requirement of high-quality multimodal data.
- Multimodal Generalization Training (MGT): Following FRE, the model undergoes continued training on restricted, complex multimodal reasoning tasks, allowing the generalization of reasoning skills to diverse multimodal domains. This stage targets two primary domains: general multimodal reasoning and agent-related reasoning.
Experimental Validation:
- The authors employed Qwen2.5-VL-Instruct-3B as the baseline and applied the LMM-R1 framework to demonstrate significant improvements across various benchmarks. Notable achievements include average performance increases of 4.83% and 4.5% over baselines in multimodal and text-only benchmarks, respectively, and a 3.63% gain in complex Football Game tasks.
Data Efficiency:
- LMM-R1's ability to bypass the need for extensive, high-quality multimodal training data is underscored as a significant benefit. The strategic use of rule-based RL enables effective multimodal generalization, suggesting a data-efficient pathway to reasoning enhancement.

Results and Implications

The results indicate that the LMM-R1 framework not only enhances the reasoning capabilities of LMMs but does so in a resource-efficient manner. The two-stage training leverages existing textual datasets effectively, thus reducing the reliance on multimodal data which is often challenging to procure.

Generalization Across Domains: The paper provides compelling evidence that strengthening a model's foundational reasoning skills through text-only data can lead to robust multimodal reasoning capabilities. This finding is particularly salient for smaller models like those with 3 billion parameters, which inherently lack the capacity for sophisticated reasoning but benefit substantially from strategic training regimens.
Impact on Agent-Related Domains: LMM-R1's success extends to tasks requiring simulation and planning, such as Sokoban and Football. This shows the model's enhanced capability in agent domains, reflecting its potential for broader applications in AI systems requiring dynamic interaction with the environment.

Future Directions and Speculations

Expansion to Larger Models: While this paper focuses on 3B models, the framework's application to larger LMMs could unravel further efficiencies and enhancements in multimodal reasoning, fueled by the inherent greater capacity of larger models.
Automated Data Synthesis: There's a potential avenue in the synthesis of high-quality multimodal reasoning datasets to complement the text-based foundation, which could further optimize the framework's effectiveness in multimodal contexts.
Integration into Real-World Applications: Given LMM-R1's promising outcomes, integrating such robust reasoning models into real-world applications, including autonomous systems and complex decision-making frameworks, could be highly impactful.

This paper contributes to the ongoing discourse on optimizing model performance through strategic training methodologies, particularly shining a light on the balance between foundational skill enhancement and practical, application-driven performance.

Tweets

https://twitter.com/GptMaestro/status/1901228787733533094

https://twitter.com/javaeeeee1/status/1899786003612569980

https://twitter.com/susumuota/status/1901787261165031859

https://twitter.com/susumuota/status/1901787273080774795