MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision (2505.13427v2)

Published 19 May 2025 in cs.AI and cs.CV

Abstract: While Multimodal LLMs (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework. We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference setup and achieves significant improvements across both in-domain (MM-K12 test set) and out-of-domain (OlympiadBench, MathVista, etc.) benchmarks. Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance. MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at https://github.com/ModalMinds/MM-PRM.

Authors (7)

Lingxiao Du (4 papers)
Fanqing Meng (14 papers)
Zongkai Liu (9 papers)
Zhixiang Zhou (3 papers)
Ping Luo (340 papers)
Qiaosheng Zhang (35 papers)
Wenqi Shao (89 papers)

Summary

Enhancing Multimodal Mathematical Reasoning with MM-PRM

The paper "MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision" presents significant advancements in the domain of multimodal reasoning, particularly focused on the mathematical problem-solving capabilities of AI models. Despite the progress in Multimodal LLMs (MLLMs), which demonstrate prowess in tasks combining both vision and language, they fall short when confronted with complex, multi-step reasoning challenges. These shortcomings often lead to logically inconsistent solutions and inaccuracies in intermediate steps.

To address the gap in reasoning capability, the authors propose a novel process reward model, MM-PRM, within a scalable, automated, step-level supervision framework. This paper emphasizes the importance of fine-grained supervision to enhance logical robustness in multimodal reasoning tasks. The authors build upon MM-Policy, a potent multimodal model, using the MM-K12, a meticulously curated dataset of 10,000 multimodal math problems. Utilizing a Monte Carlo Tree Search (MCTS)-oriented pipeline, they generate over 700,000 step-level annotations without human involvement.

The MM-PRM has demonstrated substantial improvements in accuracy across in-domain (MM-K12 test set) and out-of-domain benchmarks like OlympiadBench and MathVista. Notably, MM-PRM enhances MM-Policy's accuracy from 33.92% to 42.80%, and InternVL2.5-8B model's accuracy from 27.01% to 37.80% on the MM-K12 test set. Such results underscore the efficacy of process supervision for enriching the logical coherence of solutions produced by MLLMs. The research highlights the utility of soft labels over hard thresholding, the selection of smaller learning rates, and the importance of path diversity in optimizing reward model performance.

Key Contributions:

Data Collection and Release: The authors introduce MM-K12, a 10,000-entry dataset tailored for multimodal math problems, ensuring verifiable answers. This dataset serves as the foundation for training MM-Policy and generating step-level annotations.
Process Supervision Framework: They innovatively apply a fully automated MCTS-based pipeline that, combined with MM-Policy, produces large-scale annotations, driving substantial enhancements in model reasoning accuracy.
Discussion on PRM Settings: In-depth exploration of training dynamics such as learning rate optimization and soft versus hard label approaches provides insights into effective PRM model training strategies.

The implications of this paper for AI research are profound, particularly in the field of educational technologies and intelligent tutoring systems. By reducing logical inaccuracies and improving coherence in multimodal reasoning, tools developed from this research have the potential to offer sophisticated educational support.

Future Directions: This research lays the groundwork for further improvement in AI-driven reasoning tasks. Future efforts may explore broader model coverage, integrate refined process supervision datasets across diverse mathematical domains, and examine cross-linguistic and cultural adaptability of reasoning models. Addressing these areas could further facilitate the advancement of AI in educational and reasoning applications.

In conclusion, this paper provides a robust framework and methodology, contributing significantly to the ongoing development of MLLMs in reasoning tasks. The outcomes of this paper offer strong prospects for AI systems requiring precision and coherent logic in complex problem-solving scenarios.

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision (2505.13427v2)

Summary

Enhancing Multimodal Mathematical Reasoning with MM-PRM

Related Papers

GitHub

YouTube