Process-Aware Group Relative Policy Optimization

Updated 19 November 2025

PA-GRPO is a reinforcement learning framework that combines process mining with group relative policy optimization to enhance multi-step reasoning in large reasoning models.
It supplements standard correctness and formatting rewards with a conformance signal derived from event log alignment between student and teacher models.
Empirical evaluations show PA-GRPO outperforms conventional approaches on mathematical reasoning benchmarks, with optimal beta tuning yielding improved performance.

Process-Aware Group Relative Policy Optimization (PA-GRPO), interchangeably referred to as PM4GRPO, is a reinforcement learning (RL) post-training framework targeting the enhancement of large reasoning models (LRMs) for multi-step tasks. Distinct from outcome-centric approaches, PA-GRPO integrates process mining techniques to supplement standard correctness and format-driven rewards with an additional conformance signal reflecting the procedural similarity of model reasoning to a pretrained teacher. This scalar conformance reward leverages event log analysis and process alignment metrics to quantify and incentivize expert-like reasoning traces. Empirical results demonstrate that PA-GRPO outperforms conventional GRPO post-training methods, particularly on challenging mathematical reasoning benchmarks (Park et al., 29 Oct 2025).

1. Formal Foundations: From PPO to GSPO and PA-GRPO

Standard Proximal Policy Optimization (PPO) maximizes a clipped surrogate objective at the token level: $\mathcal{L}^{\rm PPO}(\theta) = \mathbb{E}_{x,y \sim \pi_{\theta_{\rm old}}}\Big[\min\big(r(\theta)A,~\mathrm{clip}(r(\theta),1-\epsilon,1+\epsilon)A\big)\Big]$ where $r(\theta)$ is the likelihood ratio and $A$ is the advantage estimate. Optionally, PPO can be regularized using a KL-divergence constraint.

Group Sequence Policy Optimization (GSPO), inspired by DeepSeek-R1 et al., generalizes PPO by operating at the sequence rather than token level. For a query $x$ , $G$ sampled reasoning sequences $\{y_i\}_{i=1}^G$ receive group-wise importance ratios: $r_i(\theta) = \left( \frac{\pi_{\theta}(y_i | x)}{\pi_{\theta_{\rm old}}(y_i | x)} \right)^{1 / |y_i|}$ The corresponding objective uses a group-relative advantage $\widehat{A}_i$ : $\widehat{A}_i = R(x, y_i) - \frac{1}{G} \sum_{j=1}^{G} R(x, y_j)$ GSPO's surrogate: $\mathcal{J}_{\rm GSPO}(\theta) = \mathbb{E}_{x, \{y_i\}} \left[ \frac{1}{G} \sum_{i=1}^G \min \left( r_i(\theta)\widehat{A}_i,~\mathrm{clip}(r_i(\theta), 1 - \epsilon, 1 + \epsilon)\widehat{A}_i \right) \right]$ By moving optimization to the sequence level, GSPO aligns the reward structure with long reasoning chains, improving stability and properly weighting off-policy samples.

2. Process Mining Integration and Conformance Reward Construction

Conventional GRPO methods use reward signals based only on final correctness or format. PA-GRPO integrates a conformance reward by comparing the chain-of-thought (CoT) traces of student and teacher models using process mining.

Given a query $x$ , both models provide reasoning traces: $\sigma_i(\pi_\theta) = \langle a^i_1, a^i_2, ..., a^i_{T_i} \rangle$ , and a reference trace $\sigma_R$ . Both are treated as event logs. The Inductive Miner (IM) extracts a process model $\mathcal{M}_i$ from the student log, and conformance checking (CC) aligns this model with the teacher's log.

Alignment-based conformance yields two metrics per sequence: $(\text{fitness}_i,~\text{precision}_i) = \mathrm{CC}(\mathrm{IM}(\sigma_i(\pi_\theta)), \sigma_R)$ Here, fitness quantifies accurate reproduction of reference traces, and precision penalizes extra, unreferenced behavior allowed by $\mathcal{M}_i$ . These are merged using an F1-style metric: $R_i^c = \mathrm{F1}(\text{fitness}_i,~\text{precision}_i) = \frac{2 \cdot \text{fitness}_i \cdot \text{precision}_i}{\text{fitness}_i + \text{precision}_i}$ which forms the core process-aware signal in PA-GRPO.

3. Combined Reward Formulation and Training Workflow

The complete PM4GRPO reward for each generated reasoning sequence combines three elements: $R(x, y_i) = R_i^f + R_i^a + R_i^c$ where $R_i^f$ is a format reward, $R_i^a$ is answer correctness, and $R_i^c$ is process conformance. Generalizing, one can write: $R_{\rm total} = \alpha R_{\rm answer} + \beta R_{\rm process}$ Typical experiments set $\alpha=1$ and $\beta=1$ , though $\beta$ can be varied (found robust across $[0.5,2.0]$ with slight gains at upper end, and overfitting to teacher behavior when $\beta$ is excessive).

Training Loop (High-Level Pseudocode)

Initialize policy parameters θ ← θ₀
Freeze teacher policy π_R

for iteration in 1…N:
    Sample batch of queries {x_b} from dataset D
    for each x_b:
        for i in 1…G:
            y_i ∼ π_θ(·|x_b)
            Compute R_i^f(y_i), R_i^a(y_i)
            Extract σ_i(y_i) as event log
            M_i ← IM(σ_i(y_i))
            (fitness_i, precision_i) ← CC(M_i, σ_R)
            R_i^c ← 2·fitness_i·precision_i/(fitness_i+precision_i)
            R(x_b, y_i) = R_i^f + R_i^a + R_i^c
        μ_R ← mean over R(x_b, y_j)
        for i in 1…G:
            Â_i ← R(x_b, y_i) − μ_R
            r_i ← (π_θ(y_i|x_b)/π_{θ_old}(y_i|x_b))**(1/len(y_i))
            L_i ← min(r_i·Â_i, clip(r_i,1−ε,1+ε)·Â_i)
    θ ← θ + η·∇_θ (mean over all L_i)
    θ_old ← θ

The conformance reward is computed post-sequence, and all rewards remain at the sequence level.

4. Empirical Results Across Mathematical Reasoning Benchmarks

PA-GRPO was evaluated on five established mathematical reasoning benchmarks: MATH500, OlympiadBench, MinervaMath, AIME24, and AIME25. Models of 7B and 1.5B parameters were compared against contemporary baselines, including R1-Distill-Qwen, DeepMath-Zero, Skywork-OR1, LEAD, DRGRPO, PRIME, P-GRPO, Graph-R1, STILL-3, EXGRPO.

Held-out test accuracy (problem solved exactly) is reported below:

7B-Model Performance (accuracy %):

Model	MATH500	Olympiad	Minerva	AIME24	AIME25
R1-Distill-Qwen	90.0	58.5	49.6	42.5	33.1
DeepMath-Zero	81.6	47.3	40.4	13.3	10.0
Skywork-OR1	87.1	51.9	46.0	36.0	27.1
LEAD	84.6	52.3	47.4	40.0	26.7
DRGRPO	80.2	42.5	43.0	30.0	6.7
PRIME	79.2	–	38.6	26.7	–
P-GRPO	83.0	–	38.2	33.3	–
PM4GRPO (ours)	91.1	61.1	49.3	45.6	35.0

1.5B-Model Performance (accuracy %):

Model	MATH500	Olympiad	Minerva	AIME24	AIME25
R1-Distill-Qwen	80.4	46.1	33.1	22.9	21.5
Graph-R1	42.1	15.5	13.9	1.2	1.0
STILL-3	83.4	51.0	36.5	29.2	23.5
EXGRPO	69.6	34.0	30.4	10.6	8.3
PM4GRPO (ours)	83.9	52.7	37.9	26.7	21.7

PM4GRPO demonstrates superior or near-best performance across all benchmarks, notably on the most challenging problem sets (AIME24/25).

5. Ablation and Sensitivity Analyses

Systematic ablations tested the contribution of the conformance reward. Disabling process alignment ( $\beta = 0$ ) resulted in a performance drop of 1.8–3.2 percentage points on MATH500 and OlympiadBench, evidencing the benefit of process-aware signals.

Sensitivity sweeps over $\beta$ produced stable plateaus in performance, with:

$\beta=0.5$ : 90.2% (–0.9 pp vs default)
$\beta=1.0$ : 91.1% (default)
$\beta=2.0$ : 91.4% (+0.3 pp, with observed overfitting to teacher behavior and increased reasoning chain length)

These results suggest tuning $\beta$ within $[1,2]$ achieves robust performance without compromising generalization.

6. Limitations and Future Extensions

Conformance checking in PA-GRPO introduces computational overhead proportional to the square of trace length, presenting scalability challenges for very long reasoning chains. The dependency on a pretrained teacher for process mining constrains the reward to teacher-aligned reasoning, thus failing to incentivize novel but correct reasoning strategies outside the teacher’s style.

Proposed future directions include:

Learnable Process Models: Transitioning from fixed IM + CC pipelines to differentiable process-model learners.
Hierarchical Conformance: Applying process-alignment rewards at granular reasoning step levels, such as sub-theorem validation within proofs.
Multi-teacher Aggregation: Incorporating multiple teacher traces to encourage diverse yet valid reasoning.

A plausible implication is that continued refinement of conformance metrics and process models could further broaden the expressive and generalization capabilities of LRMs under RL paradigms. PA-GRPO establishes the methodological value of aligning chain-of-thought with expert process logs in rigorous reasoning domains (Park et al., 29 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Reasoning-Aware GRPO using Process Mining (2025)

Follow Topic

Get notified by email when new papers are published related to Process-Aware Group Relative Policy Optimization (PA-GRPO).