Training process reward models for long LLM reasoning traces
Determine effective and reliable training methodologies for explicit value functions, also referred to as process reward models, over long reasoning traces generated by large language models so that step-level credit assignment can be performed accurately in long-horizon reasoning tasks.
References
Prior work attempts to amortize this process by training explicit value functions (or process reward models)~\citep{setlur2024rewarding,luo2024improvemathematicalreasoninglanguage}, but how we should train such functions over long reasoning traces remains an open question.
— InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning
(2601.14209 - Yang et al., 20 Jan 2026) in Section 1, Introduction