- The paper presents a novel framework that reframes stepwise reward modeling as a meta-reasoning task by training generative chain-of-thought judges with reinforcement learning.
- The paper details a self-segmentation method combined with Monte Carlo Q-value estimation to accurately label and evaluate reasoning chunks.
- The paper demonstrates improved error detection and multi-step reasoning accuracy on ProcessBench, significantly outperforming discriminative baselines.
StepWiser: Stepwise Generative Judges for Wiser Reasoning
The paper addresses a central challenge in the alignment and training of LLMs for complex multi-step reasoning tasks: the supervision of intermediate reasoning steps. While process reward models (PRMs) have been proposed to provide stepwise feedback, existing approaches are predominantly discriminative classifiers trained via supervised fine-tuning (SFT) on static datasets. These methods lack explanatory power and exhibit limited generalization to novel reasoning patterns. The authors propose reframing stepwise reward modeling as a reasoning task, introducing a generative judge that reasons about the policy model's reasoning steps (meta-reasoning) and outputs explicit chain-of-thought (CoT) rationales before delivering a verdict. The StepWiser framework is trained via reinforcement learning (RL) using dense, stepwise signals derived from Monte Carlo rollouts.
StepWiser Methodology
The StepWiser pipeline consists of three key components:
- Self-Segmentation of Reasoning Trajectories: The base policy model is fine-tuned to segment its CoT into coherent, logically complete chunks (chunks-of-thought), using SFT data generated by LLMs prompted with explicit segmentation rules. This produces more informative steps and reduces the number of segments per trajectory, which is critical for efficient annotation and evaluation.
- Stepwise Data Annotation via Monte Carlo Q-Value Estimation: For each chunk, the expected final reward (Q-value) is estimated by generating multiple rollouts from that point and computing the average success rate. Binary labels are assigned to each chunk using absolute Q-value thresholding (Abs-Q), relative improvement (Rel-Ratio), or effective reward (Rel-Effective), capturing both correctness and progress in reasoning.
- Online RL Training of Generative Judges: The judge model is trained to generate a CoT analysis of each chunk and then output a final judgment. The reward signal is 1 if the judgment matches the Monte Carlo-derived label, 0 otherwise. Training is performed online using GRPO, with prompt dataset balancing to mitigate class imbalance and stabilize learning.

Figure 1: Overview of the StepWiser training method, illustrating segmentation, Monte Carlo rollouts for Q-value estimation, and RL training of the generative judge.
Experimental Results and Ablations
ProcessBench Evaluation
StepWiser is evaluated on ProcessBench, which tests the ability to identify the first incorrect step in math reasoning trajectories. The RL-trained generative judge consistently outperforms SFT-trained discriminative baselines and other community models across all learning signals and model scales. For example, with Qwen2.5-7B-chunk and Rel-Effective labeling, StepWiser achieves an average score of 61.9, compared to 39.7 for the discriminative baseline. Majority voting at inference time yields modest further improvements, indicating that the binary nature of stepwise judgments limits the benefit of output aggregation.
Ablation Studies
Ablations reveal that both the generative CoT reasoning and online RL training are essential for optimal performance. Offline rejection sampling fine-tuning (RS-FT) leads to rapid saturation and inferior results, while discriminative judges trained with RL but without CoT reasoning underperform compared to the full StepWiser pipeline. Prompt dataset balancing is critical; omitting it induces strong class bias and model collapse.



Figure 2: Stepwise accuracy and training loss curves for various judge setups, demonstrating the necessity of generative CoT and RL training.
Training Dynamics
The training loss curves for discriminative judges under different learning signals show rapid convergence and limited expressivity, especially for larger models. The use of entropy regularization and techniques such as "clip higher" are necessary to maintain exploration and prevent mode collapse during RL training.



Figure 3: Training loss curves for discriminative stepwise judges under different learning signals and model sizes.
Inference-Time Search and Data Selection
StepWiser is applied to chunk-reset reasoning, where the judge evaluates each chunk during generation and triggers re-sampling upon detecting errors. This approach yields consistent improvements in final solution accuracy without increasing the length of accepted responses, demonstrating superior error detection and correction capabilities. For data selection, StepWiser's average chunk score is used to select high-quality training examples, resulting in improved downstream model performance compared to outcome-based or discriminative selection.
Implementation Considerations
- Computational Requirements: Monte Carlo annotation is resource-intensive, requiring extensive rollouts per chunk. Self-segmentation reduces the number of chunks and thus the annotation cost.
- RL Training Stability: Prompt balancing and entropy regularization are necessary to prevent class bias and mode collapse.
- Scalability: The method is demonstrated on both 1.5B and 7B models, with clear scaling trends. The approach is compatible with larger models and more complex reasoning tasks.
- Deployment: StepWiser judges can be used both during training (for reward modeling and data selection) and at inference (for error correction and search).
Theoretical and Practical Implications
The paper provides strong empirical evidence that explicit meta-reasoning via generative CoT and dense, stepwise RL signals yields superior reward models for multi-step reasoning. The approach leverages the intrinsic reasoning capabilities of LLMs, aligning the training of judges with the training of policy models. The use of relative progress signals (Rel-Ratio, Rel-Effective) is shown to be more effective than absolute correctness, suggesting that reward modeling should account for incremental improvements in reasoning trajectories.





Figure 4: Test stepwise accuracy for different judge architectures and labeling methods, highlighting the superiority of RL-trained generative judges.
Future Directions
- Generalization to Other Domains: While the focus is on mathematical reasoning, the methodology is applicable to any domain requiring multi-step reasoning and intermediate supervision.
- Integration with Tool-Augmented Agents: StepWiser could be extended to agents interacting with external tools, where stepwise verification is critical.
- Active Data Selection and Curriculum Learning: The judge's scores could be used to drive active learning and curriculum design for more efficient model training.
- Scaling to Larger Models and More Complex Tasks: Further work is needed to optimize annotation and training pipelines for models with longer reasoning chains and more diverse error modes.
Conclusion
StepWiser introduces a principled framework for stepwise generative reward modeling in LLM reasoning, combining self-segmentation, Monte Carlo annotation, and RL training of meta-reasoning judges. The empirical results demonstrate substantial improvements over existing baselines in both evaluation and practical applications. The findings underscore the importance of explicit reasoning and dense, stepwise supervision for robust alignment and error detection in multi-step reasoning tasks. The approach sets a new standard for process reward modeling and opens avenues for further research in scalable, interpretable, and generalizable reward models for LLMs.