Puzzle Curriculum GRPO (PC-GRPO)
- Puzzle Curriculum GRPO (PC-GRPO) is a reinforcement learning framework that extends GRPO by incorporating self-supervised puzzles and reward shaping to boost generalization.
- It employs innovative puzzle environments like PatchFit, Rotation, and Jigsaw with offline scoring and dynamic curriculum scheduling to provide verifiable, annotation-free rewards.
- Empirical findings show that PC-GRPO improves chain-of-thought consistency and multimodal performance, offering a scalable alternative to traditional supervised training.
Puzzle Curriculum Group Relative Policy Optimization (PC-GRPO) is a reinforcement learning post-training paradigm designed to enhance reasoning and generalization capabilities in large models through puzzle-based, curriculum-driven optimization. It extends Group Relative Policy Optimization (GRPO) by integrating self-supervised puzzle environments, offline and dynamic difficulty scoring, reward shaping, and curriculum-induced sample weighting. PC-GRPO is applicable to both intent detection in dialogue systems and vision-centric multimodal reasoning, offering scalable, annotation-free training by leveraging verifiable rewards (Feng et al., 18 Apr 2025, Jeddi et al., 16 Dec 2025).
1. Core Principles and GRPO Foundation
PC-GRPO builds upon GRPO, a reinforcement learning objective employing PPO-style policy updates with group-relative baselines. For a stochastic policy generating trajectories in a reward-yielding environment, the GRPO objective is:
where and relative advantage , with as the group-specific value baseline. This structure stabilizes policy gradients across diverse puzzle types grouped by difficulty or family (Feng et al., 18 Apr 2025).
2. Self-Supervised Puzzle Environments
PC-GRPO redefines outcome supervision by substituting human-labeled or externally verified rewards with self-supervised puzzles possessing programmatically computable, verifiable rewards:
- PatchFit: Given an image with one patch masked, select the correct one among candidates. Binary reward: if correct, $0$ otherwise.
- Rotation: Predict the true rotation angle from options. Binary reward: if the angle matches, $0$ otherwise.
- Jigsaw: Reassemble a randomly permuted grid. Graded reward: , tiles, providing partial credit (Jeddi et al., 16 Dec 2025).
Each puzzle is posed as a chain-of-thought (CoT) multiple-choice problem, with sampled rollouts per prompt.
3. Curriculum Mechanisms: Offline Scoring and Difficulty Weighting
To maximize learning efficiency and generalization, PC-GRPO employs multi-stage curriculum design:
- Offline scoring: Each puzzle is rolled out times with an initial policy; total rewards . Lower indicates higher difficulty (Feng et al., 18 Apr 2025).
- Curriculum scheduling: At training iteration ( total), the bottom fraction of hardest puzzles is selected:
where (e.g., $0.1$) is the initial proportion; puzzles are sorted by ascending .
- Dynamic difficulty weighting: During training, each group’s loss contribution is weighted by a function based on intra-group difficulty . For binary tasks, ; for Jigsaw, use permutation diversity:
This emphasizes medium-hard samples ( peaks at ) and suppresses flat/easy or fully intractable puzzles (Jeddi et al., 16 Dec 2025).
4. PC-GRPO Policy Update and Reward Shaping
The full PC-GRPO policy optimization is:
where is the token-level importance weight and is the group-relative advantage.
Reward shaping and auxiliary objectives enforce both valid intermediate reasoning and correct final answers. For example, in intent detection, the shaped reward is , optionally with penalties for thought-chain length or step sparsity (Feng et al., 18 Apr 2025).
5. Chain-of-Thought Integration and Consistency Diagnostics
All variants can instantiate CoT by prompting the model to generate explicit reasoning steps (e.g., adding a “Thought:” line). Empirically, CoT substantially improves generalization on hard intent detection and vision tasks, providing higher advantage estimates and improved gradient signal (Feng et al., 18 Apr 2025).
PC-GRPO further tracks Reasoning–Answer Consistency (RAC), a diagnostic metric for ensuring that the generated chain-of-thought entails the final answer. RAC is measured by a frozen judge model that verifies if the rationale in rollouts explicitly supports the answer:
where indicates consistency at rollout . RAC typically rises and then degrades in vanilla GRPO; the curriculum in PC-GRPO flattens and sustains the RAC curve, and lightweight consistency-aware reward schemes further improve RAC, correlating with downstream accuracy (Jeddi et al., 16 Dec 2025).
6. Implementation and Practical Adaptation
Table: Major PC-GRPO Parameters in Vision-Centric Experiments
| Component | Setting | Value |
|---|---|---|
| Backbone | Qwen-VL-2.5-Instruct (7B, 3B) | 7B/3B |
| Puzzle Data | COCO-2014: Jigsaw (15K), PatchFit (15K), Rotation (10K) | 40K total |
| RL Config | PPO clip , no KL anchor , batch size 16/8 (CARE), | – |
| Post-Procs | GRPO-CARE: EMA 0.995, bonus coeff 0.5, conf. 0.95 | – |
| Hardware | 8×A100 GPUs | ~100 hr (Jigsaw) |
Models are trained token-level, sampling rollouts per prompt; all puzzle rewards are self-generating, requiring no human annotation or verifier. PC-GRPO applies across intent detection (Feng et al., 18 Apr 2025) and vision-language tasks (Jeddi et al., 16 Dec 2025), with hyperparameters adjusted per domain.
7. Empirical Findings and Benchmarking
PC-GRPO yields the following empirical results:
- Puzzle-specific gains: Post-training on a given puzzle (e.g., Jigsaw) increases test accuracy from approximately 25% to 36%. Gains do not significantly transfer across puzzle types without a mixed curriculum.
- Curriculum and RAC: Curriculum schedules maintain higher reward variance and delay the decline in RAC compared to vanilla GRPO. Addition of a consistency-aware reward (GRPO-CARE) maximizes RAC and training stability.
- Downstream multimodal tasks: On multimodal benchmarks (MME, MMStar, POPE, MMT-Bench, CV-Bench-2D, MMVP, ColorBench, LISA-Grounding, SEED-Bench), PC-GRPO exhibits superior chain-of-thought and direct-answer performance compared to both annotation-free (Vision-Zero, ViCrit, Visual Jigsaw, VisualSphinx) and annotation-reliant baselines, even outperforming GRPO-CARE trained with human labels in several cases (Jeddi et al., 16 Dec 2025).
- Benchmark cleaning: Removal of mislabelled or ambiguous samples from popular benchmarks further accentuates the gains from PC-GRPO.
A plausible implication is that the annotation-free, verifiable reward structure, in conjunction with medium-difficulty-focused curriculum weighting, is critical for both stable learning and interpretable, scalable policy post-training.
8. Significance and Future Directions
PC-GRPO demonstrates a principled, supervision-free recipe for reinforcement learning with verifiable rewards, circumventing the costs and inconsistencies of annotation or external verification. Its curriculum mechanisms address vanishing group-relative advantages, while self-supervised puzzle environments enable scalable application to both language and vision-LLMs. RAC provides a generalizable, post-hoc diagnostic for reasoning fidelity.
The framework offers a pathway for training robust, adaptable policies in scenarios where explicit annotation is impractical or reward signals are otherwise sparse. Future research may expand puzzle environments for more diverse reasoning modalities, refine curriculum schedules dynamically, or further correlate RAC to advanced downstream metrics (Feng et al., 18 Apr 2025, Jeddi et al., 16 Dec 2025).