Principled RAC-based checkpoint selection for GRPO-trained VLMs
Develop principled rules and a formal methodology for selecting training checkpoints during GRPO-style reinforcement learning post-training of vision-language models using only the Reasoning–Answer Consistency (RAC) metric or another single training-time signal, so that checkpoint choice can be determined without downstream validation.
Sponsor
References
However, formalizing checkpoint selection solely from RAC (or any single training-time signal) remains nontrivial; we leave principled selection rules to future work.
— Puzzle Curriculum GRPO for Vision-Centric Reasoning
(2512.14944 - Jeddi et al., 16 Dec 2025) in Supplementary Section S3: RAC Measurement and Checkpoint Reporting