Training process reward models for long LLM reasoning traces

Determine effective and reliable training methodologies for explicit value functions, also referred to as process reward models, over long reasoning traces generated by large language models so that step-level credit assignment can be performed accurately in long-horizon reasoning tasks.

Background

The paper highlights that outcome-reward reinforcement learning assigns credit only at the final answer, creating a need for step-level credit assignment along long reasoning traces. A common proposed solution is to learn explicit value functions—often called process reward models (PRMs)—that assess intermediate steps.

However, while prior work has attempted to amortize step-level evaluation by training PRMs, the authors note that the community lacks clear, effective procedures for training these models over long reasoning traces produced by LLMs. This uncertainty motivates their alternative approach, Intervention Training (InT), which sidesteps explicit PRM training.

References

Prior work attempts to amortize this process by training explicit value functions (or process reward models)~\citep{setlur2024rewarding,luo2024improvemathematicalreasoninglanguage}, but how we should train such functions over long reasoning traces remains an open question.

InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning  (2601.14209 - Yang et al., 20 Jan 2026) in Section 1, Introduction