Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward (2506.07218v1)

Published 8 Jun 2025 in cs.LG, cs.AI, and cs.CV

Abstract: Enhancing the multimodal reasoning capabilities of Multimodal LLMs (MLLMs) is a challenging task that has attracted increasing attention in the community. Recently, several studies have applied Reinforcement Learning with Verifiable Rewards (RLVR) to the multimodal domain in order to enhance the reasoning abilities of MLLMs. However, these works largely overlook the enhancement of multimodal perception capabilities in MLLMs, which serve as a core prerequisite and foundational component of complex multimodal reasoning. Through McNemar's test, we find that existing RLVR method fails to effectively enhance the multimodal perception capabilities of MLLMs, thereby limiting their further improvement in multimodal reasoning. To address this limitation, we propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately, thereby can effectively incentivizing both their multimodal perception and reasoning capabilities. Specifically, we first collect textual visual annotations from the CoT trajectories of multimodal problems, which will serve as visual references for reward assignment. During RLVR training, we employ a judging LLM to assess the consistency between the visual annotations and the responses generated by MLLM, and assign the visual perception reward based on these consistency judgments. Extensive experiments on several multimodal reasoning benchmarks demonstrate the effectiveness of our Perception-R1, which achieves state-of-the-art performance on most benchmarks using only 1,442 training data.

Insights into Enhancing Multimodal Reasoning with Visual Perception Reward

Recent developments in the field of Multimodal LLMs (MLLMs) have advanced the capabilities of AI systems to engage in complex multimodal reasoning. The paper "Advancing Multimodal Reasoning Capabilities of Multimodal LLMs via Visual Perception Reward" addresses a critical gap in current methodologies: the enhancement of multimodal perception, an essential precondition for effective multimodal reasoning. This paper introduces Perception-R1, a novel approach that combines reinforcement learning with a unique visual perception reward to bolster the reasoning capabilities of MLLMs.

At the core of this research lies the observation that existing methods employing Reinforcement Learning with Verifiable Rewards (RLVR) inadequately enhance multimodal perception capabilities. The authors substantiate this claim with McNemar's test, highlighting that RLVR-trained MLLMs display no statistically significant improvement in perception compared to their baseline counterparts. Recognizing this bottleneck, the paper posits that rewarding accurate visual perception is fundamental to overcoming the constraints of RLVR and achieving superior multimodal reasoning.

Perception-R1 facilitates this by introducing a method to explicitly encourage accurate perception of visual content, integrating it into the RLVR training process. Textual visual annotations, extracted from CoT (Chain of Thought) trajectories, serve as benchmarks for visual content. During training, a judging LLM is tasked with assessing the congruence between these annotations and the MLLM-generated responses. This process forms the basis for assigning a visual perception reward, which, when incorporated into RLVR, demonstrated empirical effectiveness across multiple reasoning benchmarks.

From a methodological perspective, the authors' introduction of the visual perception reward is both innovative and practical. It relies not on proprietary solutions but rather on publicly available resources and existing models to construct a rewarding mechanism that is both efficient and scalable. Training on a dataset comprising only 1,442 samples, Perception-R1 achieved state-of-the-art results across several prominent benchmarks with improved roles in accuracy and perception. This highlights an exceptional leap in data efficiency, a crucial advantage in the rapidly developing sphere of AI where resources and training data are often limited.

The paper's results illustrate significant improvements across benchmarks like MathVista and MathVerse, suggesting that visual perception is indeed a pivotal component of overall reasoning capabilities. The implications of integrating a perception reward span both theoretical and practical dimensions. Theoretically, it encourages a shift in focus toward a more comprehensive understanding of multimodal tasks, integrating perceptive accuracy as a fundamental tenant. Practically, these improvements in MLLMs could vastly enhance real-world applications such as AI-driven diagnostics in medical imaging, automated captioning systems, and educational tools that require context-driven visual understanding.

However, the research is not without its limitations. The reliance on LLMs for behavior judgment poses potential biases and inconsistencies, a challenge acknowledged by the authors and illustrative of the complexities within AI interpretability and reliability. Additionally, the method's applicability beyond the multimodal math domain remains to be thoroughly investigated.

Future developments in AI could see the extension of Perception-R1's principles to a wider range of modalities and use cases, such as real-time decision-making systems or more intricately synthesized media. A possible expansion of this work could involve integrating more nuanced perception checks that account for dynamic visual contexts and evolving multimodal datasets.

In conclusion, the paper underlines the imperative for a refined approach that integrates perception accuracy into the learning framework of MLLMs. The authors' insights broaden the horizon of multimodal reasoning methodologies, establishing a foundation for subsequent innovations in artificial intelligence that prioritize perceptual clarity alongside logical reasoning.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Tong Xiao (119 papers)
  2. Xin Xu (187 papers)
  3. Zhenya Huang (52 papers)
  4. Hongyu Gao (27 papers)
  5. Quan Liu (116 papers)
  6. Qi Liu (485 papers)
  7. Enhong Chen (242 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com