URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics (2501.04686v5)

Published 8 Jan 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Process Reward Models (PRMs) have shown promise in enhancing the mathematical reasoning capabilities of LLMs through Test-Time Scaling (TTS). However, their integration into multimodal reasoning remains largely unexplored. In this work, we take the first step toward unlocking the potential of PRMs in multimodal mathematical reasoning. We identify three key challenges: (1) the scarcity of high-quality reasoning data constrains the capabilities of foundation Multimodal LLMs (MLLMs), which imposes further limitations on the upper bounds of TTS and reinforcement learning (RL); (2) a lack of automated methods for process labeling within multimodal contexts persists; (3) the employment of process rewards in unimodal RL faces issues like reward hacking, which may extend to multimodal scenarios. To address these issues, we introduce URSA, a three-stage Unfolding multimodal Process-Supervision Aided training framework. We first construct MMathCoT-1M, a high-quality large-scale multimodal Chain-of-Thought (CoT) reasoning dataset, to build a stronger math reasoning foundation MLLM, URSA-8B. Subsequently, we go through an automatic process to synthesize process supervision data, which emphasizes both logical correctness and perceptual consistency. We introduce DualMath-1.1M to facilitate the training of URSA-8B-RM. Finally, we propose Process-Supervised Group-Relative-Policy-Optimization (PS-GRPO), pioneering a multimodal PRM-aided online RL method that outperforms vanilla GRPO. With PS-GRPO application, URSA-8B-PS-GRPO outperforms Gemma3-12B and GPT-4o by 8.4% and 2.7% on average across 6 benchmarks. Code, data and checkpoint can be found at https://github.com/URSA-MATH.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/TheTuringPost/status/1879492348859818435

https://twitter.com/Pha_Tran_Papers/status/1878044426435490287

https://twitter.com/arXivGPT/status/1877778504328392885

https://twitter.com/rohanpaul_ai/status/1880344742162231456

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics (2501.04686v5)

Summary

Related Papers

Tweets