Perception-Aware Policy Optimization for Multimodal Reasoning (2507.06448v1)

Published 8 Jul 2025 in cs.CL

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing LLMs with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.

Summary

The paper presents PAPO, a reinforcement learning framework that integrates an implicit perception loss into GRPO to enhance visual grounding in multimodal reasoning.
It achieves up to 8% performance gains on vision-dependent tasks and reduces perception errors by 30.5%, addressing key shortcomings of prior RLVR methods.
The method also demonstrates faster convergence and stable training via double entropy loss regularization, making it compatible with existing RLVR improvements.

Perception-Aware Policy Optimization for Multimodal Reasoning: An Expert Overview

This paper introduces Perception-Aware Policy Optimization (PAPO), a reinforcement learning framework designed to address the unique challenges of multimodal reasoning in large multimodal models (LMMs). The work builds upon the Group Relative Policy Optimization (GRPO) algorithm, a variant of Proximal Policy Optimization (PPO) that has demonstrated strong performance in text-based reasoning tasks, and extends it to better handle the perception bottleneck inherent in multimodal settings.

Motivation and Problem Analysis

The authors conduct a detailed error analysis on multimodal reasoning models trained with GRPO, revealing that 67% of errors are due to perception failures—that is, the model's inability to accurately interpret visual inputs. This is in contrast to text-only domains, where reasoning and calculation errors are more prevalent. The analysis highlights a critical gap: existing RLVR (Reinforcement Learning with Verifiable Rewards) algorithms, when naively applied to multimodal tasks, do not sufficiently incentivize models to ground their reasoning in visual content. Instead, models often exploit textual shortcuts, especially when training data redundantly encodes visual information in text.

PAPO: Methodological Contributions

PAPO introduces an Implicit Perception Loss to the GRPO objective, operationalized as a Kullback–Leibler (KL) divergence between the model’s output distributions conditioned on the original image and a corrupted (masked) version of the image. The core intuition is that a model genuinely leveraging visual information should exhibit a significant change in its output distribution when deprived of visual context. The PAPO objective thus encourages the model to maximize this KL divergence, effectively regularizing the policy to depend on meaningful visual content.

Key Implementation Details

Corrupted Visual Input: The corrupted image is generated by masking a substantial portion (typically 60%) of image patches. Both random and semantic-aware masking strategies are explored, with random masking empirically outperforming more complex approaches.
Loss Formulation: The PAPO loss augments the GRPO objective with a weighted KL divergence term:
1
J_PAPO(θ) = J_GRPO(θ) + γ * KL[π_θ(o | q, I) || π_θ(o | q, I_mask)]
where γ is a tunable coefficient.
Regularization: The authors identify a unique failure mode—model collapse due to over-optimization of the perception loss. To mitigate this, they introduce a Double Entropy Loss regularizer, penalizing high entropy in both the original and masked policy outputs, which stabilizes training in high-γ regimes.

Training and Evaluation

Models: Qwen2.5-VL-3B and 7B are used as base models.
Data: Training is performed on ViRL39K, a diverse multimodal reasoning dataset, without chain-of-thought supervision.
Benchmarks: Evaluation spans eight multimodal reasoning datasets, with a focus on both general and vision-dependent tasks.

Empirical Results

Performance Gains: PAPO achieves an average relative improvement of 4.4% over GRPO across all benchmarks, with gains approaching 8% on vision-dependent tasks.
Error Reduction: There is a 30.5% reduction in perception errors, directly addressing the primary failure mode identified in the error analysis.
Convergence: PAPO demonstrates faster convergence, with early-stage gains observable within 25 training steps.
Compatibility: PAPO is shown to be compatible with other RLVR algorithmic improvements, such as removing the reference KL penalty, yielding compounded improvements (up to 11.2% on 3B models).

Ablation and Analysis

Loss Weighting (γ): Increasing γ up to 0.02 improves performance, but higher values induce model collapse. Larger models are more sensitive to this effect.
Masking Strategy: Random masking is both effective and computationally efficient. Masking ratios between 0.6 and 0.8 are optimal; extreme masking (e.g., full blackening) is detrimental.
Regularization: Double Entropy Loss is the most effective regularizer for preventing collapse, outperforming single-entropy and increased KL penalty approaches.

Computational Considerations

Overhead: PAPO introduces a moderate computational overhead due to the additional forward pass with masked images (approximately 49 seconds per step on 3B/7B models using H100 GPUs). This is a practical consideration for large-scale training but is not prohibitive.

Theoretical and Practical Implications

PAPO represents a shift from data- and reward-centric approaches to a deeper integration of perception into the core optimization objective. By directly regularizing the model to be sensitive to visual input, PAPO addresses the fundamental limitation of prior RLVR methods in multimodal domains. This approach is particularly relevant for tasks where visual grounding is essential and cannot be circumvented by textual shortcuts.

From a theoretical perspective, the work demonstrates that multimodal reasoning requires algorithmic adaptations beyond those effective in text-only settings. The explicit modeling of perception as a policy regularizer opens avenues for further research into modality-specific optimization in RL for LMMs.

Future Directions

Scalability: Extending PAPO to larger model sizes and diverse architectures (e.g., InternVL) is a natural next step.
Algorithmic Synergy: Integrating PAPO with advanced RLVR techniques (e.g., DAPO with dynamic sampling) may yield further gains.
Efficiency: Reducing the computational overhead of the additional forward pass is an open engineering challenge.
Generalization: Investigating the impact of PAPO on tasks with varying degrees of vision dependency and on out-of-distribution generalization remains an important direction.

Conclusion

PAPO provides a principled and empirically validated approach to enhancing visually grounded reasoning in LMMs. By introducing an implicit perception loss and addressing its associated optimization challenges, the method sets a new standard for RL-based multimodal reasoning. The findings underscore the necessity of perception-aware objectives in the design of future multimodal AI systems.