- The paper demonstrates that the token-level log-probability ratio between RL-trained and reference policies recovers the optimal advantage function.
- It employs diverse aggregation strategies over checkpoints to improve evaluation, sample selection, and uncertainty estimation in agentic settings.
- Empirical results show significant boosts in task success rates and reliable error localization across multiple benchmarks and LLM backbones.
Implicit Process Reward for LLM Agents via Progress Advantage
Introduction
This paper targets the central challenge of process-level credit assignment in LLM agents operating in stochastic, long-horizon environments. Standard reinforcement learning (RL) post-training pipelines provide outcome-level reward feedback, but process reward modeling (PRM)—the construction of fine-grained step-level reward signals—remains prohibitively expensive and largely unexplored for complex agentic architectures. Agentic tasks involve stochastic, multi-turn exchanges with external environments, which invalidate both annotated reward collection and Monte Carlo estimation strategies for trajectory evaluation at scale. The authors propose a novel, annotation-free approach by extracting "progress advantage": a log-probability ratio between RL-trained and reference policies, theoretically shown to recover the optimal advantage function under KL-regularized RL and empirically validated for agentic evaluation and monitoring.
Theoretical Framework
Problem Setting
The agent setting is cast as a token-level Markov Decision Process (MDP) with stochastic transitions, where states represent action-observation histories, actions are generated tokens, and external environment feedback drives dynamics. The policy is optimized for cumulative reward under a KL regularization constraint—standard in RL fine-tuning of LLMs.
Progress Advantage
The paper proves that, for any policy obtained by KL-regularized RL (including both explicit KL regularization and clipping-based surrogates), the step-level log-probability ratio between the RL-trained behavior policy and its reference policy exactly recovers the optimal advantage function:
A∗(s,a)=Q∗(s,a)−V∗(s)=βlogπref(a∣s)π∗(a∣s)
This applies to the general stochastic MDP setting, extending earlier work which only held in deterministic or non-interactive regimes. The key insight is that, while the reward itself cannot be recovered due to the non-cancellation of value terms under stochasticity, the advantage function remains accessible since the log ratio absorbs the expectation over future values.
The progress advantage can therefore be directly computed from available RL checkpoint pairs (behavior and reference policies), with no additional training or reward model inference, and is applicable across domains and mainline RL algorithms.
Practical Instantiation
Translation into practice requires careful selection of behavior and reference policy checkpoints. The reference must be neither too distant nor too proximate to the behavior policy to avoid signal dilution or irrelevance. Aggregation strategies over per-token advantages (sum, mean, step-wise min/max, or weighted combinations) are required to score steps or trajectories, and the choice impacts effectiveness per application. Top-k smoothed variants of the log-probability can improve stability in noisy settings.
Empirical Evaluation
Benchmarks and Model Families
Progress advantage was evaluated across four agentic benchmarks (BFCLv4-MT, WebShop, AgentDojo, τ-bench) and four diverse LLM backbones (Gemma4, Qwen3.5, Qwen3, Olmo3). Key tasks covered include multi-turn tool use, web navigation, general task completion, and customer-service conversations.
Test-Time Scaling (Best-of-N Selection)
Progress advantage-guided reranking of multiple candidate agent trajectories outperformed both self-confidence and pre-trained reward models, as well as task-specialized PRMs and proprietary LLM scoring, in boosting task success rates under high-entropy, exploratory sampling. The improvement margins (e.g., +15.5% over best self-confidence baseline on Gemma4 models) are robust across models and tasks.
Uncertainty Quantification
Progress advantage yields superior AUROC for predicting agent success/failure over trajectory samples, outperforming all confidence-based and reward model baselines, including LLM-as-a-Judge methods. It also generalizes as an effective off-the-shelf scorer for evaluating trajectories sampled from other policy backbones, demonstrating broad transferability.
Failure Attribution
The method shows strong step-level localization of critical model errors in long trajectories, matching or exceeding the accuracy of RL-trained, task-specific failure attribution models.
Discriminative Power and Aggregation
Progress advantage consistently demonstrates better reliability and discriminative power than the log-probabilities of either the behavior or reference model alone. The empirical utility is sensitive to aggregation strategy: mean, min, or max selection at token/step level must be tailored per application. Top-k smoothing has nuanced effects and may be optimal for tasks requiring sharper step attribution.
Implications
Annotation-Free, Domain-Agnostic PRM
Progress advantage provides a zero-annotation, domain-agnostic, and training-free PRM replacement for LLM agents, enabling scalable process-level evaluation, test-time inference reranking, uncertainty estimation, and failure attribution. It also provides a template for robust runtime monitoring and post-deployment intervention without application-specific adaptation.
Theoretical Unification
The derivation unifies implicit process reward modeling across deterministic and stochastic MDPs, establishing the theoretical sufficiency of contrasting post-trained and reference policy distributions for extracting policy advantage.
Model Merging and Reference Selection
The efficacy of progress advantage hinges on the choice of reference policy; advanced model merging and interpolation methods (e.g., TIES, WISE) can further enhance the discriminativity of the progress signal, suggesting new research directions for scalable and robust agentic PRM construction.
Reusable Training Artifacts
This work advocates for open access to intermediate model checkpoints, so that community members can construct inference-time reward signals and monitoring instruments from standard RL training artifacts. The push for fully transparent model development pipelines aligns with broader sustainable and updatable machine learning trends.
Modality Generalization
Since progress advantage fundamentally relies on policy probabilities, it is likely extensible to multimodal agent architectures, including vision-language-action (VLA), embodied, and multi-agent systems.
Future Directions
Future research should:
- Systematically explore reference policy selection, merging, and interpolation for sharper advantage signals,
- Control for deviation from RL optimality in deployed checkpoints,
- Scale to multimodal, embodied, and interactive agent scenarios,
- Integrate progress advantage-driven signals into closed-loop self-improvement and automatic reward distillation pipelines.
Conclusion
Progress advantage presents a theoretically grounded and empirically robust implicit reward formulation for process-level evaluation of RL-trained LLM agents under stochastic, long-horizon interactions. It bypasses the prohibitive costs of manual annotation and task-specific reward modeling, while consistently improving sample selection, monitoring, and error diagnosis across multiple agentic tasks and architectures. This marks a significant advance in practical and scalable PRM construction for real-world LLM agents and lays the foundation for broader adoption of sustainable "free lunch" artifacts in scalable AI development.