Adversarially Trained PRMs (APRM)
- The paper demonstrates that APRM reframes reward modeling as a two-player adversarial game where a generator crafts plausible yet incorrect reasoning steps, enabling robust training without expensive human annotation.
- APRM employs a dynamic adversarial curriculum, alternating optimization using PPO and OGDA to progressively challenge the detector with hard negatives and improve generalization on complex tasks.
- Empirical results in domains such as math and science confirm APRM’s enhanced solver accuracy and cross-domain robustness compared to static, hand-annotated benchmarks.
Adversarially Trained PRMs (APRM) are a class of process reward models for LLMs in which reward modeling is framed as a two-player adversarial game between a generator that produces plausible but incorrect reasoning steps and a detector (process reward model, PRM) that learns to discriminate between correct and adversarially generated incorrect steps. This framework eliminates the need for expensive human step-level annotation by using algorithmic oracles for correctness, and leverages dynamic, adversarial curricula to drive improved generalization and robustness on complex reasoning tasks (Juneja et al., 28 Nov 2025).
1. Formal Framework: Game-Theoretic Formulation
APRM recasts PRM training as a two-player, general-sum, strongly monotone game:
- Generator : Given a valid reasoning step from distribution , generates a perturbed step . The generator's objective is to produce incorrect but plausible steps that can deceive the detector.
- Reward Model : Receives the (original step, perturbed step) pair and predicts a binary label (correct/incorrect). True correctness is supplied by a symbolic oracle.
- Reward Definitions:
- if , 0 otherwise (reward to the PRM for correct detection).
- 1 if 2 and 3 (generator fools the PRM), 4 if 5 and 6, 7 if 8 (generator fails to corrupt).
- Utilities:
- 9
- 0
- Regularization: KL-divergence to fixed reference policies and entropy regularizers are added to both objectives for exploration and stability.
- Optimization: Both players are trained alternately (using PPO and Optimistic Gradient Descent-Ascent, OGDA) to minimize their respective losses. Training converges to a Nash equilibrium in mixed strategies.
This adversarial setup ensures the PRM is confronted with increasingly subtle errors, challenging the model beyond the limitations of static, hand-annotated datasets (Juneja et al., 28 Nov 2025).
2. Model Architectures and Inputs
Both 1 and 2 are parameterized as decoder-only transformers, initialized from a pretrained checkpoint (Llama-3.1-8B):
- Generator Inputs: Tokenized correct step 3 plus a “perturb–this–step” prompt. Output is a corrupted or altered reasoning step 4.
- Reward Model Inputs: Full question 5, partial chain 6 up to the current step, and candidate step 7. Output is a scalar logit for 8.
- No extra heads or adapters are used; the entire model is fine-tuned end-to-end.
The design allows for direct transfer and compatibility with a wide range of transformer backbone architectures (Juneja et al., 28 Nov 2025).
3. Training Algorithm and Curriculum
APRM employs a continually evolving adversarial curriculum, formalized as follows:
1
- Alternating optimization: 5 PPO steps for each player alternately, freezing the other. The adversarial negative buffer maintains hard negatives across training and prevents forgetting.
- Emergent supervision: The generator produces adversarial steps judged by a symbolic oracle; the PRM is trained using a balanced mix of gold (human or algorithmic) and adversarial data.
This iterative curriculum generates a dynamic set of ever-harder negatives, facilitating robust step-level detection (Juneja et al., 28 Nov 2025).
4. Emergent Step-Level Supervision and Oracle Evaluation
Unlike conventional PRMs that require hand-labeled step-level correctness, APRM relies exclusively on:
- Algorithmic oracle supervision: Correctness labels are computed via symbolic equivalence checks, entity matching, or other non-LLM static oracles.
- Adversarial generator: The generator 9 learns to craft increasingly subtle semantic or contextual errors not present in the pretraining distribution or static curated datasets.
- Fine-grained detection: 0 must develop rich, contextualized step representations to separate “correct” steps from a growing diversity of “adversarial” negatives.
This process enables robust step-level supervision without manual annotation, and the reward model learns to recognize both syntactic and semantic invalidity (Juneja et al., 28 Nov 2025).
5. Empirical Benchmarks and Performance
APRM demonstrates improved solver and detector performance across multiple domains:
| Setting | APRM Improvement (pp) | Baseline | Metric |
|---|---|---|---|
| Math (avg 5 tasks) | +3.4 | ReST-MCTS | Solver accuracy |
| OOD: JEEBench | +5.3 | ReST-MCTS | Solver accuracy |
| SciBench (GPT-OSS-20B) | 63.0 vs 61.8 | ReST-MCTS | Cross-domain science |
| RL posttraining | +6.8 | ReST-MCTS, Outcome sup. | RL with GRPO |
- Ablations reveal entropy regularization and OGDA are critical for solver accuracy: removing both reduces performance by 6.8 percentage points.
- Generalization: APRM-PRMs achieve gains on out-of-distribution (OOD) tasks (e.g., JEEBench, cross-domain SciBench) not seen during training.
- Adversarial Curriculum: The generator produces hard negatives including semantically valid but contextually false reasoning steps, far beyond static augmentation (Juneja et al., 28 Nov 2025).
6. Robustness, Generalization, and Theoretical Guarantees
APRM leverages strong monotonicity via regularized game dynamics and OGDA optimization:
- Nash equilibrium: Training converges to a stable, non-cycling fixed point in the space of generator-detector policies.
- Adaptive negative mining: Generator scale correlates with error diversity; larger models produce subtler errors, further enhancing PRM robustness.
- Cross-domain transfer: APRM reward models generalize to identifying non-obvious scientific errors (e.g. physics unit conversion in chemistry) without domain-specific annotation.
- Theoretical stability: The Nash equilibrium property provides guarantees of convergence and robustness, in contrast to heuristic or purely supervised PRM pipelines (Juneja et al., 28 Nov 2025).
7. Significance and Extensions
APRM establishes a data-efficient paradigm for robust step-level validation in complex reasoning tasks, removing dependence on human annotation and static datasets. Its key contributions include:
- Dynamic, adversarially constructed training curricula, yielding a “moving target” for the reward model.
- Empirically validated gains in both mathematical and scientific domains, including OOD robustness.
- General applicability to any process reward model context where step-level perturbations can be algorithmically evaluated.
- Theoretical underpinnings rooted in monotone games and convergence under OGDA, supporting further research into scalable, stable adversarial training for model-based oversight (Juneja et al., 28 Nov 2025).
A plausible implication is that adversarial training of PRMs represents a scalable pathway to robust automated oversight of LLM-based reasoners, particularly in domains with limited annotated step-level data and evolving error distributions.