A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models (2510.08049v2)

Published 9 Oct 2025 in cs.CL and cs.AI

Abstract: Although LLMs exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.

Summary

The paper introduces and evaluates Process Reward Models (PRMs), shifting focus from outcome signals to detailed process-level supervision for enhanced LLM reasoning.
It details diverse data generation methods, including human annotation, automated supervision, and semi-automated approaches to balance quality and scalability.
The survey discusses practical applications and benchmarks across domains like math, code, and robotics, emphasizing PRMs’ potential to improve diagnostic precision.

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for LLMs

Introduction

The paper "A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for LLMs" (2510.08049) presents a comprehensive survey of Process Reward Models (PRMs), which aim to enhance alignment in LLMs by shifting focus from final outcome judgments to detailed evaluations of intermediate reasoning steps. This transition from Outcome Reward Models (ORMs) to PRMs addresses the inadequacies of static outcome-centric approaches, particularly in capturing and guiding stepwise reasoning processes essential for complex reasoning tasks.

Generating Process Data

Data generation for PRMs is categorized into human annotation, automated supervision, and semi-automated approaches. Human annotation, though resource-intensive, provides high-fidelity signals crucial for benchmarking. Automated supervision employs symbolic verification, execution feedback, and Monte Carlo Tree Search (MCTS) to scale data generation without human intervention. Semi-automated methods combine human-created seed data with automated expansion, optimizing resource use while maintaining quality.

Building Process Reward Models

PRMs are classified into discriminative, generative, implicit, and other architectural innovations. Discriminative PRMs use pointwise and pairwise losses to score intermediate steps. Generative PRMs involve a "think-then-judge" approach, facilitating extended reasoning chains and improved semantic comprehension. Implicit PRMs leverage indirect supervision, eschewing explicit step-level labels. Architectural innovations, such as graph-based and multimodal models, enrich reasoning representations or structural design.

Application of PRMs

PRMs have broad applicability across domains such as math, code, multimodal reasoning, text, robotics, and interactive agents. They enable enhanced grading, tutoring, and safeguarding logical consistency in mathematical reasoning, improve robustness in code generation, and enhance interpretability and factual consistency in multimodal tasks. They also hold promise in high-stakes sectors like finance and medicine, where precise, validated reasoning is critical.

Benchmarking

Recent benchmarks like PRMBench, ProcessBench, and ViLBench evaluate PRMs across various dimensions including reasoning style, multimodal tasks, and long-horizon decision-making. These benchmarks test the robustness, generalization, and adaptability of PRMs, providing a structured framework for comparison and evaluation.

Discussion

The survey highlights PRMs' transformative potential in evolving LLMs from coarse outcome judgment to nuanced process-level diagnostics and optimization. As PRMs require significant resource investments for step-level annotations, the paper discusses the necessity to integrate automated data generation to mitigate costs. The paper also addresses potential challenges like error propagation in automated approaches and the need for cross-domain generalization.

Conclusion

Process Reward Models represent a significant advance in aligning LLMs through fine-grained supervision. By improving diagnostic capability and robustness in reasoning tasks, PRMs facilitate enhanced alignment and performance in complex problem-solving contexts. Future challenges include optimizing annotation processes, enhancing generalization, and integrating with advanced planning and memory systems to further improve reasoning fidelity and applicability across domains.

This survey provides a structured insight into PRMs, guiding future research towards developing efficient, scalable, and interpretable reasoning alignment techniques for LLMs.