Efficacy of PGPO on Large-Scale LVLMs

Ascertain whether Perception-Grounded Policy Optimization (PGPO) remains effective when applied to Large Vision-Language Models that exceed 7B parameters by verifying its efficacy on substantially larger model scales.

Background

The paper introduces Perception-Grounded Policy Optimization (PGPO) to improve token-level credit assignment in multimodal reinforcement learning and reports gains using Qwen2.5-VL models up to 7B parameters.

Due to computational limits, experiments did not include larger models. Although results suggest a positive scaling trend, the authors explicitly note that confirming PGPO’s effectiveness on larger-scale models remains unverified, leaving scalability an open issue.

References

Furthermore, limited by our computational resources, our experiments validate PGPO on models up to the 7B parameter scale. Although these results indicate a positive scaling trend, its efficacy on large-scale models remains to be verified.