Papers
Topics
Authors
Recent
2000 character limit reached

Process-Aware Group Relative Policy Optimization

Updated 19 November 2025
  • PA-GRPO is a reinforcement learning framework that combines process mining with group relative policy optimization to enhance multi-step reasoning in large reasoning models.
  • It supplements standard correctness and formatting rewards with a conformance signal derived from event log alignment between student and teacher models.
  • Empirical evaluations show PA-GRPO outperforms conventional approaches on mathematical reasoning benchmarks, with optimal beta tuning yielding improved performance.

Process-Aware Group Relative Policy Optimization (PA-GRPO), interchangeably referred to as PM4GRPO, is a reinforcement learning (RL) post-training framework targeting the enhancement of large reasoning models (LRMs) for multi-step tasks. Distinct from outcome-centric approaches, PA-GRPO integrates process mining techniques to supplement standard correctness and format-driven rewards with an additional conformance signal reflecting the procedural similarity of model reasoning to a pretrained teacher. This scalar conformance reward leverages event log analysis and process alignment metrics to quantify and incentivize expert-like reasoning traces. Empirical results demonstrate that PA-GRPO outperforms conventional GRPO post-training methods, particularly on challenging mathematical reasoning benchmarks (Park et al., 29 Oct 2025).

1. Formal Foundations: From PPO to GSPO and PA-GRPO

Standard Proximal Policy Optimization (PPO) maximizes a clipped surrogate objective at the token level: LPPO(θ)=Ex,yπθold[min(r(θ)A, clip(r(θ),1ϵ,1+ϵ)A)]\mathcal{L}^{\rm PPO}(\theta) = \mathbb{E}_{x,y \sim \pi_{\theta_{\rm old}}}\Big[\min\big(r(\theta)A,~\mathrm{clip}(r(\theta),1-\epsilon,1+\epsilon)A\big)\Big] where r(θ)r(\theta) is the likelihood ratio and AA is the advantage estimate. Optionally, PPO can be regularized using a KL-divergence constraint.

Group Sequence Policy Optimization (GSPO), inspired by DeepSeek-R1 et al., generalizes PPO by operating at the sequence rather than token level. For a query xx, GG sampled reasoning sequences {yi}i=1G\{y_i\}_{i=1}^G receive group-wise importance ratios: ri(θ)=(πθ(yix)πθold(yix))1/yir_i(\theta) = \left( \frac{\pi_{\theta}(y_i | x)}{\pi_{\theta_{\rm old}}(y_i | x)} \right)^{1 / |y_i|} The corresponding objective uses a group-relative advantage A^i\widehat{A}_i: A^i=R(x,yi)1Gj=1GR(x,yj)\widehat{A}_i = R(x, y_i) - \frac{1}{G} \sum_{j=1}^{G} R(x, y_j) GSPO's surrogate: JGSPO(θ)=Ex,{yi}[1Gi=1Gmin(ri(θ)A^i, clip(ri(θ),1ϵ,1+ϵ)A^i)]\mathcal{J}_{\rm GSPO}(\theta) = \mathbb{E}_{x, \{y_i\}} \left[ \frac{1}{G} \sum_{i=1}^G \min \left( r_i(\theta)\widehat{A}_i,~\mathrm{clip}(r_i(\theta), 1 - \epsilon, 1 + \epsilon)\widehat{A}_i \right) \right] By moving optimization to the sequence level, GSPO aligns the reward structure with long reasoning chains, improving stability and properly weighting off-policy samples.

2. Process Mining Integration and Conformance Reward Construction

Conventional GRPO methods use reward signals based only on final correctness or format. PA-GRPO integrates a conformance reward by comparing the chain-of-thought (CoT) traces of student and teacher models using process mining.

Given a query xx, both models provide reasoning traces: σi(πθ)=a1i,a2i,...,aTii\sigma_i(\pi_\theta) = \langle a^i_1, a^i_2, ..., a^i_{T_i} \rangle, and a reference trace σR\sigma_R. Both are treated as event logs. The Inductive Miner (IM) extracts a process model Mi\mathcal{M}_i from the student log, and conformance checking (CC) aligns this model with the teacher's log.

Alignment-based conformance yields two metrics per sequence: (fitnessi, precisioni)=CC(IM(σi(πθ)),σR)(\text{fitness}_i,~\text{precision}_i) = \mathrm{CC}(\mathrm{IM}(\sigma_i(\pi_\theta)), \sigma_R) Here, fitness quantifies accurate reproduction of reference traces, and precision penalizes extra, unreferenced behavior allowed by Mi\mathcal{M}_i. These are merged using an F1-style metric: Ric=F1(fitnessi, precisioni)=2fitnessiprecisionifitnessi+precisioniR_i^c = \mathrm{F1}(\text{fitness}_i,~\text{precision}_i) = \frac{2 \cdot \text{fitness}_i \cdot \text{precision}_i}{\text{fitness}_i + \text{precision}_i} which forms the core process-aware signal in PA-GRPO.

3. Combined Reward Formulation and Training Workflow

The complete PM4GRPO reward for each generated reasoning sequence combines three elements: R(x,yi)=Rif+Ria+RicR(x, y_i) = R_i^f + R_i^a + R_i^c where RifR_i^f is a format reward, RiaR_i^a is answer correctness, and RicR_i^c is process conformance. Generalizing, one can write: Rtotal=αRanswer+βRprocessR_{\rm total} = \alpha R_{\rm answer} + \beta R_{\rm process} Typical experiments set α=1\alpha=1 and β=1\beta=1, though β\beta can be varied (found robust across [0.5,2.0][0.5,2.0] with slight gains at upper end, and overfitting to teacher behavior when β\beta is excessive).

Training Loop (High-Level Pseudocode)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Initialize policy parameters θ  θ
Freeze teacher policy π_R

for iteration in 1N:
    Sample batch of queries {x_b} from dataset D
    for each x_b:
        for i in 1G:
            y_i  π_θ(·|x_b)
            Compute R_i^f(y_i), R_i^a(y_i)
            Extract σ_i(y_i) as event log
            M_i  IM(σ_i(y_i))
            (fitness_i, precision_i)  CC(M_i, σ_R)
            R_i^c  2·fitness_i·precision_i/(fitness_i+precision_i)
            R(x_b, y_i) = R_i^f + R_i^a + R_i^c
        μ_R  mean over R(x_b, y_j)
        for i in 1G:
            Â_i  R(x_b, y_i)  μ_R
            r_i  (π_θ(y_i|x_b)/π_{θ_old}(y_i|x_b))**(1/len(y_i))
            L_i  min(r_i·Â_i, clip(r_i,1ε,1+ε)·Â_i)
    θ  θ + η·_θ (mean over all L_i)
    θ_old  θ
The conformance reward is computed post-sequence, and all rewards remain at the sequence level.

4. Empirical Results Across Mathematical Reasoning Benchmarks

PA-GRPO was evaluated on five established mathematical reasoning benchmarks: MATH500, OlympiadBench, MinervaMath, AIME24, and AIME25. Models of 7B and 1.5B parameters were compared against contemporary baselines, including R1-Distill-Qwen, DeepMath-Zero, Skywork-OR1, LEAD, DRGRPO, PRIME, P-GRPO, Graph-R1, STILL-3, EXGRPO.

Held-out test accuracy (problem solved exactly) is reported below:

7B-Model Performance (accuracy %):

Model MATH500 Olympiad Minerva AIME24 AIME25
R1-Distill-Qwen 90.0 58.5 49.6 42.5 33.1
DeepMath-Zero 81.6 47.3 40.4 13.3 10.0
Skywork-OR1 87.1 51.9 46.0 36.0 27.1
LEAD 84.6 52.3 47.4 40.0 26.7
DRGRPO 80.2 42.5 43.0 30.0 6.7
PRIME 79.2 38.6 26.7
P-GRPO 83.0 38.2 33.3
PM4GRPO (ours) 91.1 61.1 49.3 45.6 35.0

1.5B-Model Performance (accuracy %):

Model MATH500 Olympiad Minerva AIME24 AIME25
R1-Distill-Qwen 80.4 46.1 33.1 22.9 21.5
Graph-R1 42.1 15.5 13.9 1.2 1.0
STILL-3 83.4 51.0 36.5 29.2 23.5
EXGRPO 69.6 34.0 30.4 10.6 8.3
PM4GRPO (ours) 83.9 52.7 37.9 26.7 21.7

PM4GRPO demonstrates superior or near-best performance across all benchmarks, notably on the most challenging problem sets (AIME24/25).

5. Ablation and Sensitivity Analyses

Systematic ablations tested the contribution of the conformance reward. Disabling process alignment (β=0\beta = 0) resulted in a performance drop of 1.8–3.2 percentage points on MATH500 and OlympiadBench, evidencing the benefit of process-aware signals.

Sensitivity sweeps over β\beta produced stable plateaus in performance, with:

  • β=0.5\beta=0.5: 90.2% (–0.9 pp vs default)
  • β=1.0\beta=1.0: 91.1% (default)
  • β=2.0\beta=2.0: 91.4% (+0.3 pp, with observed overfitting to teacher behavior and increased reasoning chain length)

These results suggest tuning β\beta within [1,2][1,2] achieves robust performance without compromising generalization.

6. Limitations and Future Extensions

Conformance checking in PA-GRPO introduces computational overhead proportional to the square of trace length, presenting scalability challenges for very long reasoning chains. The dependency on a pretrained teacher for process mining constrains the reward to teacher-aligned reasoning, thus failing to incentivize novel but correct reasoning strategies outside the teacher’s style.

Proposed future directions include:

  1. Learnable Process Models: Transitioning from fixed IM + CC pipelines to differentiable process-model learners.
  2. Hierarchical Conformance: Applying process-alignment rewards at granular reasoning step levels, such as sub-theorem validation within proofs.
  3. Multi-teacher Aggregation: Incorporating multiple teacher traces to encourage diverse yet valid reasoning.

A plausible implication is that continued refinement of conformance metrics and process models could further broaden the expressive and generalization capabilities of LRMs under RL paradigms. PA-GRPO establishes the methodological value of aligning chain-of-thought with expert process logs in rigorous reasoning domains (Park et al., 29 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Process-Aware Group Relative Policy Optimization (PA-GRPO).