Papers
Topics
Authors
Recent
Search
2000 character limit reached

Puzzle Curriculum GRPO (PC-GRPO)

Updated 18 December 2025
  • Puzzle Curriculum GRPO (PC-GRPO) is a reinforcement learning framework that extends GRPO by incorporating self-supervised puzzles and reward shaping to boost generalization.
  • It employs innovative puzzle environments like PatchFit, Rotation, and Jigsaw with offline scoring and dynamic curriculum scheduling to provide verifiable, annotation-free rewards.
  • Empirical findings show that PC-GRPO improves chain-of-thought consistency and multimodal performance, offering a scalable alternative to traditional supervised training.

Puzzle Curriculum Group Relative Policy Optimization (PC-GRPO) is a reinforcement learning post-training paradigm designed to enhance reasoning and generalization capabilities in large models through puzzle-based, curriculum-driven optimization. It extends Group Relative Policy Optimization (GRPO) by integrating self-supervised puzzle environments, offline and dynamic difficulty scoring, reward shaping, and curriculum-induced sample weighting. PC-GRPO is applicable to both intent detection in dialogue systems and vision-centric multimodal reasoning, offering scalable, annotation-free training by leveraging verifiable rewards (Feng et al., 18 Apr 2025, Jeddi et al., 16 Dec 2025).

1. Core Principles and GRPO Foundation

PC-GRPO builds upon GRPO, a reinforcement learning objective employing PPO-style policy updates with group-relative baselines. For a stochastic policy πθ(as)\pi_\theta(a|s) generating trajectories τ\tau in a reward-yielding environment, the GRPO objective is:

LGRPO(θ)=Eτπθold[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{\tau\sim\pi_{\theta_{\mathrm{old}}}} \left[ \min\left(r_t(\theta)\,\hat A_t,\,\mathrm{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\,\hat A_t\right) \right]

where rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t|s_t)} and relative advantage A^t=k=0γkRt+kVθoldg(st)\hat A_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k} - V^g_{\theta_{\mathrm{old}}}(s_t), with VgV^g as the group-specific value baseline. This structure stabilizes policy gradients across diverse puzzle types grouped by difficulty or family (Feng et al., 18 Apr 2025).

2. Self-Supervised Puzzle Environments

PC-GRPO redefines outcome supervision by substituting human-labeled or externally verified rewards with self-supervised puzzles possessing programmatically computable, verifiable rewards:

  • PatchFit: Given an image with one patch masked, select the correct one among D+1D+1 candidates. Binary reward: riPF=1r_i^{\mathrm{PF}}=1 if correct, $0$ otherwise.
  • Rotation: Predict the true rotation angle from KK options. Binary reward: riROT=1r_i^{\mathrm{ROT}}=1 if the angle matches, $0$ otherwise.
  • Jigsaw: Reassemble a randomly permuted M×NM\times N grid. Graded reward: riJIG=correct(oi)Tr_i^{\mathrm{JIG}} = \frac{\mathrm{correct}(o_i)}{T}, T=MNT = M \cdot N tiles, providing partial credit (Jeddi et al., 16 Dec 2025).

Each puzzle is posed as a chain-of-thought (CoT) multiple-choice problem, with GG sampled rollouts per prompt.

3. Curriculum Mechanisms: Offline Scoring and Difficulty Weighting

To maximize learning efficiency and generalization, PC-GRPO employs multi-stage curriculum design:

  • Offline scoring: Each puzzle is rolled out GG times with an initial policy; total rewards Scorei=j=1GRi,j\mathrm{Score}_i = \sum_{j=1}^G R^{i,j}. Lower Scorei\mathrm{Score}_i indicates higher difficulty (Feng et al., 18 Apr 2025).
  • Curriculum scheduling: At training iteration tt (TT total), the bottom p(t)p(t) fraction of hardest puzzles is selected:

p(t)=p0+(1p0)tTp(t) = p_0 + (1-p_0)\frac{t}{T}

where p0p_0 (e.g., $0.1$) is the initial proportion; puzzles are sorted by ascending Scorei\mathrm{Score}_i.

  • Dynamic difficulty weighting: During training, each group’s loss contribution is weighted by a function w(d(q))w(d(q)) based on intra-group difficulty d(q)d(q). For binary tasks, d=rˉ=1Girid = \bar r = \frac{1}{G}\sum_{i} r_i; for Jigsaw, use permutation diversity:

d=M1G1,w(d)=4σd(1d),σ=1.8d = \frac{M-1}{G-1},\quad w(d) = 4\sigma d(1 - d), \,\, \sigma=1.8

This emphasizes medium-hard samples (ww peaks at d=0.5d = 0.5) and suppresses flat/easy or fully intractable puzzles (Jeddi et al., 16 Dec 2025).

4. PC-GRPO Policy Update and Reward Shaping

The full PC-GRPO policy optimization is:

JPC-GRPO(θ)=E(q,a),{oi}[w(d(q))1Gi=1G1oit=1oimin(ρi,tA^i,ρ~i,tA^i)]\mathcal{J}_{\mathrm{PC\text{-}GRPO}}(\theta) = \mathbb{E}_{(q,a),\,\{o_i\}} \left[ w\bigl(d(q)\bigr) \frac{1}{G}\sum_{i=1}^G \frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \min\left(\rho_{i,t}\hat{A}_i,\,\tilde{\rho}_{i,t}\hat{A}_i\right) \right]

where ρi,t\rho_{i,t} is the token-level importance weight and A^i\hat{A}_i is the group-relative advantage.

Reward shaping and auxiliary objectives enforce both valid intermediate reasoning and correct final answers. For example, in intent detection, the shaped reward is λfmtRfmt+λansRans\lambda_\mathrm{fmt} R_\mathrm{fmt} + \lambda_\mathrm{ans} R_\mathrm{ans}, optionally with penalties for thought-chain length or step sparsity (Feng et al., 18 Apr 2025).

5. Chain-of-Thought Integration and Consistency Diagnostics

All variants can instantiate CoT by prompting the model to generate explicit reasoning steps (e.g., adding a “Thought:” line). Empirically, CoT substantially improves generalization on hard intent detection and vision tasks, providing higher advantage estimates and improved gradient signal (Feng et al., 18 Apr 2025).

PC-GRPO further tracks Reasoning–Answer Consistency (RAC), a diagnostic metric for ensuring that the generated chain-of-thought entails the final answer. RAC is measured by a frozen judge model that verifies if the rationale in rollouts explicitly supports the answer:

RAC(t)=1Nj=1Ncj\mathrm{RAC}(t) = \frac{1}{N}\sum_{j=1}^N c_j

where cj{0,1}c_j \in \{0,1\} indicates consistency at rollout jj. RAC typically rises and then degrades in vanilla GRPO; the curriculum in PC-GRPO flattens and sustains the RAC curve, and lightweight consistency-aware reward schemes further improve RAC, correlating with downstream accuracy (Jeddi et al., 16 Dec 2025).

6. Implementation and Practical Adaptation

Table: Major PC-GRPO Parameters in Vision-Centric Experiments

Component Setting Value
Backbone Qwen-VL-2.5-Instruct (7B, 3B) 7B/3B
Puzzle Data COCO-2014: Jigsaw (15K), PatchFit (15K), Rotation (10K) 40K total
RL Config PPO clip ϵ=0.2\epsilon=0.2, no KL anchor β=0\beta=0, batch size 16/8 (CARE), G=8G=8
Post-Procs GRPO-CARE: EMA 0.995, bonus coeff 0.5, conf. 0.95
Hardware 8×A100 GPUs ~100 hr (Jigsaw)

Models are trained token-level, sampling GG rollouts per prompt; all puzzle rewards are self-generating, requiring no human annotation or verifier. PC-GRPO applies across intent detection (Feng et al., 18 Apr 2025) and vision-language tasks (Jeddi et al., 16 Dec 2025), with hyperparameters adjusted per domain.

7. Empirical Findings and Benchmarking

PC-GRPO yields the following empirical results:

  • Puzzle-specific gains: Post-training on a given puzzle (e.g., Jigsaw) increases test accuracy from approximately 25% to 36%. Gains do not significantly transfer across puzzle types without a mixed curriculum.
  • Curriculum and RAC: Curriculum schedules maintain higher reward variance and delay the decline in RAC compared to vanilla GRPO. Addition of a consistency-aware reward (GRPO-CARE) maximizes RAC and training stability.
  • Downstream multimodal tasks: On multimodal benchmarks (MME, MMStar, POPE, MMT-Bench, CV-Bench-2D, MMVP, ColorBench, LISA-Grounding, SEED-Bench), PC-GRPO exhibits superior chain-of-thought and direct-answer performance compared to both annotation-free (Vision-Zero, ViCrit, Visual Jigsaw, VisualSphinx) and annotation-reliant baselines, even outperforming GRPO-CARE trained with human labels in several cases (Jeddi et al., 16 Dec 2025).
  • Benchmark cleaning: Removal of mislabelled or ambiguous samples from popular benchmarks further accentuates the gains from PC-GRPO.

A plausible implication is that the annotation-free, verifiable reward structure, in conjunction with medium-difficulty-focused curriculum weighting, is critical for both stable learning and interpretable, scalable policy post-training.

8. Significance and Future Directions

PC-GRPO demonstrates a principled, supervision-free recipe for reinforcement learning with verifiable rewards, circumventing the costs and inconsistencies of annotation or external verification. Its curriculum mechanisms address vanishing group-relative advantages, while self-supervised puzzle environments enable scalable application to both language and vision-LLMs. RAC provides a generalizable, post-hoc diagnostic for reasoning fidelity.

The framework offers a pathway for training robust, adaptable policies in scenarios where explicit annotation is impractical or reward signals are otherwise sparse. Future research may expand puzzle environments for more diverse reasoning modalities, refine curriculum schedules dynamically, or further correlate RAC to advanced downstream metrics (Feng et al., 18 Apr 2025, Jeddi et al., 16 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Puzzle Curriculum GRPO (PC-GRPO).