Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

Published 2 Apr 2026 in cs.AI | (2604.01840v2)

Abstract: While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-LLMs (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be released on https://github.com/Yzk1114/PGPO.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces PGPO, a token-level credit assignment method that uses KL divergence to quantify visual dependency in LVLM decoding.
PGPO employs threshold gating and sum-preserving advantage reallocation to reduce gradient noise and boost visual reasoning accuracy by up to 18.7%.
Empirical results on benchmarks like MathVerse demonstrate improved training stability and effective visual anchoring to enhance multimodal learning.

Perception-Grounded Policy Optimization for Fine-Grained Credit Assignment in Large Vision-LLMs

Motivation and Problem Statement

Recent advances in Reinforcement Learning from Verifiable Rewards (RLVR) have directly impacted the development of Large Vision-LLMs (LVLMs), yet mainstream RLVR frameworks apply sequence-level, uniform advantage assignment across all output tokens during policy optimization. This indiscriminate credit allocation fundamentally mismatches the nature of multimodal reasoning, where only a sparse subset of tokens within a reasoning trajectory exhibits high, causal dependency on the visual modality. Uniform advantage assignment dilutes critical learning signals indispensable for robust visual reasoning steps and introduces unnecessary gradient noise from language-driven or modality-independent tokens, thus impeding effective multimodal RL.

The authors of "Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-LLMs" (2604.01840) introduce a formalized framework to address this credit assignment flaw on the token level, leveraging information-theoretic quantification of visual dependency to selectively amplify the RL signal on tokens crucial for perception-grounded reasoning.

Quantification of Token Visual Dependency

A central technical contribution is the precise quantification of "Token Visual Dependency" using the Kullback-Leibler (KL) divergence between the model’s predictive distribution with and without visual conditioning for a given context: $\mathcal{S}(s_t, I) := D_{\mathrm{KL}}(\pi_\theta(\cdot | s_t, I) \parallel \pi_\theta(\cdot | s_t))$ where $s_t$ denotes the autoregressive decoding context at position $t$ and $I$ denotes the input image.

This metric captures the causal information gain (“Bayesian surprise”) contributed by the image, is strictly non-negative, and aligns with the conditional mutual information between image and generated token given the current context. Empirical studies on the MathVerse benchmark show that distribution of $\mathcal{S}$ is highly skewed: most tokens have negligible visual dependency, while visually critical anchor tokens receive substantially higher scores.

Figure 1: Distribution of token-level visual dependency is long-tailed and sparse, with anchor tokens (e.g., numbers, geometric entities) consistently displaying elevated visual dependency.

Furthermore, analysis confirms that tokens most likely to cause hallucinations (e.g., irrelevant or imagined concepts) exhibit markedly lower $\mathcal{S}$ than truthful, visually grounded ones.

Perception-Grounded Policy Optimization (PGPO) Framework

Building on this quantification, the paper introduces Perception-Grounded Policy Optimization (PGPO), a fine-grained, token-level credit assignment algorithm for RLVR with the following mechanism:

Token Visual Dependency Scoring: $\mathcal{S}_t$ is compressed via logarithmic scaling and min-max normalization to form a bounded dependency score $I_t \in [0, 1]$ .
Threshold-Gated, Mass-Conserving Advantage Reallocation: A threshold $\tau$ partitions tokens into visually-relevant ( $I_t \ge \tau$ ) and irrelevant ( $s_t$ 0). Advantage values are adaptively suppressed or boosted based on $s_t$ 1 using a piecewise linear transformation; all weights are sum-normalized to conserve total signal and anchor a zero-mean baseline for training stability.
Figure 2: PGPO dynamically reshapes token-level advantages based on estimated visual dependency, preserving total advantage mass and enforcing stability.
Integration with Policy Gradient Objective: The modulated advantages enter standard actor-critic objectives, replacing sequence-level advantages in vanilla GRPO/DAPO.

Theoretical results show that PGPO reduces gradient variance by scaling down gradient contributions of visually irrelevant tokens (to $s_t$ 2, $s_t$ 3), provably improving signal-to-noise ratio for policy updates and ensuring monotonicity and rank preservation of assigned weights.

Empirical Results and Analysis

Comprehensive evaluation with Qwen2.5-VL-3B/7B models—using the ViRL39K training set and seven rigorous multimodal reasoning benchmarks (Geo3k, MMK12, MathVerse, DynaMath, MathVision, LogicVista, MMMU-Pro)—demonstrates substantial improvements over baseline approaches (GRPO, DAPO, PAPO, VPPO).

Figure 3: PGPO consistently yields higher accuracy across training epochs, with accelerated early convergence and enhanced final performance.

PGPO achieves an average +18.7% gain in accuracy relative to vanilla Qwen2.5-VL and surpasses the strongest prior variants, especially on highly vision-centric benchmarks (e.g., MathVerse, MMMU-Pro). Ablation studies confirm the criticality of combining both suppression of noise tokens and boosting of visual anchors, as well as the necessity of sum-preserving normalization for stable optimization.

PGPO also demonstrates:

Elimination of late-stage training collapse typical in competing approaches (e.g., DAPO), evidenced by avoidance of entropy explosion.
Rapid and sustained increases in average visual dependency throughout training, signifying effective reallocation of model capacity towards visual perception.

Case Studies

Case study analysis further illustrates that PGPO selectively reinforces reasoning steps tightly grounded in visual information, resulting in significantly improved accuracy and consistency on challenging math and general VQA items versus uniform credit assignment baselines.

Figure 4: On geometry reasoning, PGPO correctly identifies minimal pivotal rotation by emphasizing critical angle tokens, outperforming sequence-level RL baselines.

Figure 5: For architectural plan recognition, PGPO highlights and rewards visually-dependent tokens, yielding correct structural classification, whereas baselines misattribute credit to language-biased tokens.

Figure 6: In artifact identification, PGPO selectively amplifies the RL signal on tokens associated with visual features, resolving ambiguities that confound standard approaches.

Computational and Practical Considerations

PGPO’s computational overhead is moderate (≈10% above DAPO), as KL estimates are computed efficiently via a dedicated attention mask in a second parallel forward pass using Monte Carlo estimation for scalability. Extensive hyperparameter search confirms robust generalization of the primary threshold/boosting settings across tasks and model sizes.

Implications and Future Directions

PGPO provides an information-theoretically-grounded foundation for fine-grained credit assignment in multimodal RL with LVLMs. It delivers robust improvements on complex visual reasoning while sharpening visual anchoring without overfitting to spurious textual priors or sacrificing performance on general VQA. This framework is extensible to larger model scales and more diverse multimodal tasks (e.g., document reasoning, GUI understanding), though experiments focused on the 3B/7B scale due to resource constraints.

Theoretical implications include the formal demonstration of the necessity for zero-mean advantage in RLVR for stable optimization, and the strict benefit of monotonic, intra-sequence rank-preserving advantage modulation tied to information gain. Practically, PGPO’s principle of token-sensitive reward assignment aligns with recent trends in LLMs/MLLMs emphasizing sparse, critical token-wise learning over blunt sequence-level updates.

Conclusion

The perception-grounded PGPO paradigm fundamentally advances fine-grained optimization in multimodal RL and establishes a rigorous methodological bridge between information-theoretic metrics and reinforcement learning operations in LVLMs. Its token-level advantage sculpting yields empirically and theoretically superior reasoning capability, robust training, and improved factuality by ensuring credit is focused where visual grounding is essential.

Reference:

"Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-LLMs" (2604.01840).

Markdown Report Issue