Process-Aware Group Relative Policy Optimization
- PA-GRPO is a reinforcement learning framework that combines process mining with group relative policy optimization to enhance multi-step reasoning in large reasoning models.
- It supplements standard correctness and formatting rewards with a conformance signal derived from event log alignment between student and teacher models.
- Empirical evaluations show PA-GRPO outperforms conventional approaches on mathematical reasoning benchmarks, with optimal beta tuning yielding improved performance.
Process-Aware Group Relative Policy Optimization (PA-GRPO), interchangeably referred to as PM4GRPO, is a reinforcement learning (RL) post-training framework targeting the enhancement of large reasoning models (LRMs) for multi-step tasks. Distinct from outcome-centric approaches, PA-GRPO integrates process mining techniques to supplement standard correctness and format-driven rewards with an additional conformance signal reflecting the procedural similarity of model reasoning to a pretrained teacher. This scalar conformance reward leverages event log analysis and process alignment metrics to quantify and incentivize expert-like reasoning traces. Empirical results demonstrate that PA-GRPO outperforms conventional GRPO post-training methods, particularly on challenging mathematical reasoning benchmarks (Park et al., 29 Oct 2025).
1. Formal Foundations: From PPO to GSPO and PA-GRPO
Standard Proximal Policy Optimization (PPO) maximizes a clipped surrogate objective at the token level: where is the likelihood ratio and is the advantage estimate. Optionally, PPO can be regularized using a KL-divergence constraint.
Group Sequence Policy Optimization (GSPO), inspired by DeepSeek-R1 et al., generalizes PPO by operating at the sequence rather than token level. For a query , sampled reasoning sequences receive group-wise importance ratios: The corresponding objective uses a group-relative advantage : GSPO's surrogate: By moving optimization to the sequence level, GSPO aligns the reward structure with long reasoning chains, improving stability and properly weighting off-policy samples.
2. Process Mining Integration and Conformance Reward Construction
Conventional GRPO methods use reward signals based only on final correctness or format. PA-GRPO integrates a conformance reward by comparing the chain-of-thought (CoT) traces of student and teacher models using process mining.
Given a query , both models provide reasoning traces: , and a reference trace . Both are treated as event logs. The Inductive Miner (IM) extracts a process model from the student log, and conformance checking (CC) aligns this model with the teacher's log.
Alignment-based conformance yields two metrics per sequence: Here, fitness quantifies accurate reproduction of reference traces, and precision penalizes extra, unreferenced behavior allowed by . These are merged using an F1-style metric: which forms the core process-aware signal in PA-GRPO.
3. Combined Reward Formulation and Training Workflow
The complete PM4GRPO reward for each generated reasoning sequence combines three elements: where is a format reward, is answer correctness, and is process conformance. Generalizing, one can write: Typical experiments set and , though can be varied (found robust across with slight gains at upper end, and overfitting to teacher behavior when is excessive).
Training Loop (High-Level Pseudocode)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Initialize policy parameters θ ← θ₀ Freeze teacher policy π_R for iteration in 1…N: Sample batch of queries {x_b} from dataset D for each x_b: for i in 1…G: y_i ∼ π_θ(·|x_b) Compute R_i^f(y_i), R_i^a(y_i) Extract σ_i(y_i) as event log M_i ← IM(σ_i(y_i)) (fitness_i, precision_i) ← CC(M_i, σ_R) R_i^c ← 2·fitness_i·precision_i/(fitness_i+precision_i) R(x_b, y_i) = R_i^f + R_i^a + R_i^c μ_R ← mean over R(x_b, y_j) for i in 1…G: Â_i ← R(x_b, y_i) − μ_R r_i ← (π_θ(y_i|x_b)/π_{θ_old}(y_i|x_b))**(1/len(y_i)) L_i ← min(r_i·Â_i, clip(r_i,1−ε,1+ε)·Â_i) θ ← θ + η·∇_θ (mean over all L_i) θ_old ← θ |
4. Empirical Results Across Mathematical Reasoning Benchmarks
PA-GRPO was evaluated on five established mathematical reasoning benchmarks: MATH500, OlympiadBench, MinervaMath, AIME24, and AIME25. Models of 7B and 1.5B parameters were compared against contemporary baselines, including R1-Distill-Qwen, DeepMath-Zero, Skywork-OR1, LEAD, DRGRPO, PRIME, P-GRPO, Graph-R1, STILL-3, EXGRPO.
Held-out test accuracy (problem solved exactly) is reported below:
7B-Model Performance (accuracy %):
| Model | MATH500 | Olympiad | Minerva | AIME24 | AIME25 |
|---|---|---|---|---|---|
| R1-Distill-Qwen | 90.0 | 58.5 | 49.6 | 42.5 | 33.1 |
| DeepMath-Zero | 81.6 | 47.3 | 40.4 | 13.3 | 10.0 |
| Skywork-OR1 | 87.1 | 51.9 | 46.0 | 36.0 | 27.1 |
| LEAD | 84.6 | 52.3 | 47.4 | 40.0 | 26.7 |
| DRGRPO | 80.2 | 42.5 | 43.0 | 30.0 | 6.7 |
| PRIME | 79.2 | – | 38.6 | 26.7 | – |
| P-GRPO | 83.0 | – | 38.2 | 33.3 | – |
| PM4GRPO (ours) | 91.1 | 61.1 | 49.3 | 45.6 | 35.0 |
1.5B-Model Performance (accuracy %):
| Model | MATH500 | Olympiad | Minerva | AIME24 | AIME25 |
|---|---|---|---|---|---|
| R1-Distill-Qwen | 80.4 | 46.1 | 33.1 | 22.9 | 21.5 |
| Graph-R1 | 42.1 | 15.5 | 13.9 | 1.2 | 1.0 |
| STILL-3 | 83.4 | 51.0 | 36.5 | 29.2 | 23.5 |
| EXGRPO | 69.6 | 34.0 | 30.4 | 10.6 | 8.3 |
| PM4GRPO (ours) | 83.9 | 52.7 | 37.9 | 26.7 | 21.7 |
PM4GRPO demonstrates superior or near-best performance across all benchmarks, notably on the most challenging problem sets (AIME24/25).
5. Ablation and Sensitivity Analyses
Systematic ablations tested the contribution of the conformance reward. Disabling process alignment () resulted in a performance drop of 1.8–3.2 percentage points on MATH500 and OlympiadBench, evidencing the benefit of process-aware signals.
Sensitivity sweeps over produced stable plateaus in performance, with:
- : 90.2% (–0.9 pp vs default)
- : 91.1% (default)
- : 91.4% (+0.3 pp, with observed overfitting to teacher behavior and increased reasoning chain length)
These results suggest tuning within achieves robust performance without compromising generalization.
6. Limitations and Future Extensions
Conformance checking in PA-GRPO introduces computational overhead proportional to the square of trace length, presenting scalability challenges for very long reasoning chains. The dependency on a pretrained teacher for process mining constrains the reward to teacher-aligned reasoning, thus failing to incentivize novel but correct reasoning strategies outside the teacher’s style.
Proposed future directions include:
- Learnable Process Models: Transitioning from fixed IM + CC pipelines to differentiable process-model learners.
- Hierarchical Conformance: Applying process-alignment rewards at granular reasoning step levels, such as sub-theorem validation within proofs.
- Multi-teacher Aggregation: Incorporating multiple teacher traces to encourage diverse yet valid reasoning.
A plausible implication is that continued refinement of conformance metrics and process models could further broaden the expressive and generalization capabilities of LRMs under RL paradigms. PA-GRPO establishes the methodological value of aligning chain-of-thought with expert process logs in rigorous reasoning domains (Park et al., 29 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free