Principled RAC-based checkpoint selection for GRPO-trained VLMs

Develop principled rules and a formal methodology for selecting training checkpoints during GRPO-style reinforcement learning post-training of vision-language models using only the Reasoning–Answer Consistency (RAC) metric or another single training-time signal, so that checkpoint choice can be determined without downstream validation.

Background

The paper introduces Reasoning–Answer Consistency (RAC) as a diagnostic signal during GRPO post-training for vision-LLMs and observes that RAC often rises early and then declines later in training. The authors find that RAC tends to correlate with downstream task accuracy, and intermediate checkpoints near local RAC peaks can perform well.

Despite these observations, the authors do not provide a formal procedure for selecting checkpoints using RAC alone and note that establishing such principled selection rules is nontrivial. They explicitly defer this to future work, indicating a concrete unresolved methodological question.

References

However, formalizing checkpoint selection solely from RAC (or any single training-time signal) remains nontrivial; we leave principled selection rules to future work.

Puzzle Curriculum GRPO for Vision-Centric Reasoning (2512.14944 - Jeddi et al., 16 Dec 2025) in Supplementary Section S3: RAC Measurement and Checkpoint Reporting