- The paper introduces PGPO, a token-level credit assignment method that uses KL divergence to quantify visual dependency in LVLM decoding.
- PGPO employs threshold gating and sum-preserving advantage reallocation to reduce gradient noise and boost visual reasoning accuracy by up to 18.7%.
- Empirical results on benchmarks like MathVerse demonstrate improved training stability and effective visual anchoring to enhance multimodal learning.
Perception-Grounded Policy Optimization for Fine-Grained Credit Assignment in Large Vision-LLMs
Motivation and Problem Statement
Recent advances in Reinforcement Learning from Verifiable Rewards (RLVR) have directly impacted the development of Large Vision-LLMs (LVLMs), yet mainstream RLVR frameworks apply sequence-level, uniform advantage assignment across all output tokens during policy optimization. This indiscriminate credit allocation fundamentally mismatches the nature of multimodal reasoning, where only a sparse subset of tokens within a reasoning trajectory exhibits high, causal dependency on the visual modality. Uniform advantage assignment dilutes critical learning signals indispensable for robust visual reasoning steps and introduces unnecessary gradient noise from language-driven or modality-independent tokens, thus impeding effective multimodal RL.
The authors of "Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-LLMs" (2604.01840) introduce a formalized framework to address this credit assignment flaw on the token level, leveraging information-theoretic quantification of visual dependency to selectively amplify the RL signal on tokens crucial for perception-grounded reasoning.
Quantification of Token Visual Dependency
A central technical contribution is the precise quantification of "Token Visual Dependency" using the Kullback-Leibler (KL) divergence between the modelโs predictive distribution with and without visual conditioning for a given context: S(stโ,I):=DKLโ(ฯฮธโ(โ
โฃstโ,I)โฅฯฮธโ(โ
โฃstโ))
where stโ denotes the autoregressive decoding context at position t and I denotes the input image.
This metric captures the causal information gain (โBayesian surpriseโ) contributed by the image, is strictly non-negative, and aligns with the conditional mutual information between image and generated token given the current context. Empirical studies on the MathVerse benchmark show that distribution of S is highly skewed: most tokens have negligible visual dependency, while visually critical anchor tokens receive substantially higher scores.
Figure 1: Distribution of token-level visual dependency is long-tailed and sparse, with anchor tokens (e.g., numbers, geometric entities) consistently displaying elevated visual dependency.
Furthermore, analysis confirms that tokens most likely to cause hallucinations (e.g., irrelevant or imagined concepts) exhibit markedly lower S than truthful, visually grounded ones.
Perception-Grounded Policy Optimization (PGPO) Framework
Building on this quantification, the paper introduces Perception-Grounded Policy Optimization (PGPO), a fine-grained, token-level credit assignment algorithm for RLVR with the following mechanism:
Theoretical results show that PGPO reduces gradient variance by scaling down gradient contributions of visually irrelevant tokens (to stโ2, stโ3), provably improving signal-to-noise ratio for policy updates and ensuring monotonicity and rank preservation of assigned weights.
Empirical Results and Analysis
Comprehensive evaluation with Qwen2.5-VL-3B/7B modelsโusing the ViRL39K training set and seven rigorous multimodal reasoning benchmarks (Geo3k, MMK12, MathVerse, DynaMath, MathVision, LogicVista, MMMU-Pro)โdemonstrates substantial improvements over baseline approaches (GRPO, DAPO, PAPO, VPPO).


Figure 3: PGPO consistently yields higher accuracy across training epochs, with accelerated early convergence and enhanced final performance.
PGPO achieves an average +18.7% gain in accuracy relative to vanilla Qwen2.5-VL and surpasses the strongest prior variants, especially on highly vision-centric benchmarks (e.g., MathVerse, MMMU-Pro). Ablation studies confirm the criticality of combining both suppression of noise tokens and boosting of visual anchors, as well as the necessity of sum-preserving normalization for stable optimization.
PGPO also demonstrates:
- Elimination of late-stage training collapse typical in competing approaches (e.g., DAPO), evidenced by avoidance of entropy explosion.
- Rapid and sustained increases in average visual dependency throughout training, signifying effective reallocation of model capacity towards visual perception.
Case Studies
Case study analysis further illustrates that PGPO selectively reinforces reasoning steps tightly grounded in visual information, resulting in significantly improved accuracy and consistency on challenging math and general VQA items versus uniform credit assignment baselines.
Figure 4: On geometry reasoning, PGPO correctly identifies minimal pivotal rotation by emphasizing critical angle tokens, outperforming sequence-level RL baselines.
Figure 5: For architectural plan recognition, PGPO highlights and rewards visually-dependent tokens, yielding correct structural classification, whereas baselines misattribute credit to language-biased tokens.
Figure 6: In artifact identification, PGPO selectively amplifies the RL signal on tokens associated with visual features, resolving ambiguities that confound standard approaches.
Computational and Practical Considerations
PGPOโs computational overhead is moderate (โ10% above DAPO), as KL estimates are computed efficiently via a dedicated attention mask in a second parallel forward pass using Monte Carlo estimation for scalability. Extensive hyperparameter search confirms robust generalization of the primary threshold/boosting settings across tasks and model sizes.
Implications and Future Directions
PGPO provides an information-theoretically-grounded foundation for fine-grained credit assignment in multimodal RL with LVLMs. It delivers robust improvements on complex visual reasoning while sharpening visual anchoring without overfitting to spurious textual priors or sacrificing performance on general VQA. This framework is extensible to larger model scales and more diverse multimodal tasks (e.g., document reasoning, GUI understanding), though experiments focused on the 3B/7B scale due to resource constraints.
Theoretical implications include the formal demonstration of the necessity for zero-mean advantage in RLVR for stable optimization, and the strict benefit of monotonic, intra-sequence rank-preserving advantage modulation tied to information gain. Practically, PGPOโs principle of token-sensitive reward assignment aligns with recent trends in LLMs/MLLMs emphasizing sparse, critical token-wise learning over blunt sequence-level updates.
Conclusion
The perception-grounded PGPO paradigm fundamentally advances fine-grained optimization in multimodal RL and establishes a rigorous methodological bridge between information-theoretic metrics and reinforcement learning operations in LVLMs. Its token-level advantage sculpting yields empirically and theoretically superior reasoning capability, robust training, and improved factuality by ensuring credit is focused where visual grounding is essential.
Reference:
"Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-LLMs" (2604.01840).