Papers
Topics
Authors
Recent
Search
2000 character limit reached

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

Published 24 Jun 2026 in cs.LG and cs.AI | (2606.26080v1)

Abstract: Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. In this work, we show that reinforcement learning (RL) post-training already provides the ingredients for effective step-level scoring, eliminating the need for dedicated reward model training altogether. Concretely, we derive an implicit advantage under a general stochastic Markov decision process, which we term progress advantage -- log-probability ratio between the RL-trained policy and its reference policy exactly recovers the optimal advantage function. This formulation makes the resulting signal annotation-free, domain-agnostic, and available as a byproduct of the standard RL post-training pipeline. We validate the effectiveness of the progress advantage across three different applications: test-time scaling, uncertainty quantification, and failure attribution on five benchmarks and four model families. Across all settings, it consistently outperforms confidence-based baselines and, despite requiring no task-specific training, surpasses dedicated trained reward models. We complement these results with deeper analyses on characteristics of progress advantage, offering practical guidance for adoption in real-world agentic systems.

Summary

  • The paper demonstrates that the token-level log-probability ratio between RL-trained and reference policies recovers the optimal advantage function.
  • It employs diverse aggregation strategies over checkpoints to improve evaluation, sample selection, and uncertainty estimation in agentic settings.
  • Empirical results show significant boosts in task success rates and reliable error localization across multiple benchmarks and LLM backbones.

Implicit Process Reward for LLM Agents via Progress Advantage

Introduction

This paper targets the central challenge of process-level credit assignment in LLM agents operating in stochastic, long-horizon environments. Standard reinforcement learning (RL) post-training pipelines provide outcome-level reward feedback, but process reward modeling (PRM)—the construction of fine-grained step-level reward signals—remains prohibitively expensive and largely unexplored for complex agentic architectures. Agentic tasks involve stochastic, multi-turn exchanges with external environments, which invalidate both annotated reward collection and Monte Carlo estimation strategies for trajectory evaluation at scale. The authors propose a novel, annotation-free approach by extracting "progress advantage": a log-probability ratio between RL-trained and reference policies, theoretically shown to recover the optimal advantage function under KL-regularized RL and empirically validated for agentic evaluation and monitoring.

Theoretical Framework

Problem Setting

The agent setting is cast as a token-level Markov Decision Process (MDP) with stochastic transitions, where states represent action-observation histories, actions are generated tokens, and external environment feedback drives dynamics. The policy is optimized for cumulative reward under a KL regularization constraint—standard in RL fine-tuning of LLMs.

Progress Advantage

The paper proves that, for any policy obtained by KL-regularized RL (including both explicit KL regularization and clipping-based surrogates), the step-level log-probability ratio between the RL-trained behavior policy and its reference policy exactly recovers the optimal advantage function:

A(s,a)=Q(s,a)V(s)=βlogπ(as)πref(as)A^*(s,a) = Q^*(s,a) - V^*(s) = \beta \log \frac{\pi^*(a|s)}{\pi_\mathrm{ref}(a|s)}

This applies to the general stochastic MDP setting, extending earlier work which only held in deterministic or non-interactive regimes. The key insight is that, while the reward itself cannot be recovered due to the non-cancellation of value terms under stochasticity, the advantage function remains accessible since the log ratio absorbs the expectation over future values.

The progress advantage can therefore be directly computed from available RL checkpoint pairs (behavior and reference policies), with no additional training or reward model inference, and is applicable across domains and mainline RL algorithms.

Practical Instantiation

Translation into practice requires careful selection of behavior and reference policy checkpoints. The reference must be neither too distant nor too proximate to the behavior policy to avoid signal dilution or irrelevance. Aggregation strategies over per-token advantages (sum, mean, step-wise min/max, or weighted combinations) are required to score steps or trajectories, and the choice impacts effectiveness per application. Top-k smoothed variants of the log-probability can improve stability in noisy settings.

Empirical Evaluation

Benchmarks and Model Families

Progress advantage was evaluated across four agentic benchmarks (BFCLv4-MT, WebShop, AgentDojo, τ-bench) and four diverse LLM backbones (Gemma4, Qwen3.5, Qwen3, Olmo3). Key tasks covered include multi-turn tool use, web navigation, general task completion, and customer-service conversations.

Test-Time Scaling (Best-of-N Selection)

Progress advantage-guided reranking of multiple candidate agent trajectories outperformed both self-confidence and pre-trained reward models, as well as task-specialized PRMs and proprietary LLM scoring, in boosting task success rates under high-entropy, exploratory sampling. The improvement margins (e.g., +15.5% over best self-confidence baseline on Gemma4 models) are robust across models and tasks.

Uncertainty Quantification

Progress advantage yields superior AUROC for predicting agent success/failure over trajectory samples, outperforming all confidence-based and reward model baselines, including LLM-as-a-Judge methods. It also generalizes as an effective off-the-shelf scorer for evaluating trajectories sampled from other policy backbones, demonstrating broad transferability.

Failure Attribution

The method shows strong step-level localization of critical model errors in long trajectories, matching or exceeding the accuracy of RL-trained, task-specific failure attribution models.

Discriminative Power and Aggregation

Progress advantage consistently demonstrates better reliability and discriminative power than the log-probabilities of either the behavior or reference model alone. The empirical utility is sensitive to aggregation strategy: mean, min, or max selection at token/step level must be tailored per application. Top-k smoothing has nuanced effects and may be optimal for tasks requiring sharper step attribution.

Implications

Annotation-Free, Domain-Agnostic PRM

Progress advantage provides a zero-annotation, domain-agnostic, and training-free PRM replacement for LLM agents, enabling scalable process-level evaluation, test-time inference reranking, uncertainty estimation, and failure attribution. It also provides a template for robust runtime monitoring and post-deployment intervention without application-specific adaptation.

Theoretical Unification

The derivation unifies implicit process reward modeling across deterministic and stochastic MDPs, establishing the theoretical sufficiency of contrasting post-trained and reference policy distributions for extracting policy advantage.

Model Merging and Reference Selection

The efficacy of progress advantage hinges on the choice of reference policy; advanced model merging and interpolation methods (e.g., TIES, WISE) can further enhance the discriminativity of the progress signal, suggesting new research directions for scalable and robust agentic PRM construction.

Reusable Training Artifacts

This work advocates for open access to intermediate model checkpoints, so that community members can construct inference-time reward signals and monitoring instruments from standard RL training artifacts. The push for fully transparent model development pipelines aligns with broader sustainable and updatable machine learning trends.

Modality Generalization

Since progress advantage fundamentally relies on policy probabilities, it is likely extensible to multimodal agent architectures, including vision-language-action (VLA), embodied, and multi-agent systems.

Future Directions

Future research should:

  • Systematically explore reference policy selection, merging, and interpolation for sharper advantage signals,
  • Control for deviation from RL optimality in deployed checkpoints,
  • Scale to multimodal, embodied, and interactive agent scenarios,
  • Integrate progress advantage-driven signals into closed-loop self-improvement and automatic reward distillation pipelines.

Conclusion

Progress advantage presents a theoretically grounded and empirically robust implicit reward formulation for process-level evaluation of RL-trained LLM agents under stochastic, long-horizon interactions. It bypasses the prohibitive costs of manual annotation and task-specific reward modeling, while consistently improving sample selection, monitoring, and error diagnosis across multiple agentic tasks and architectures. This marks a significant advance in practical and scalable PRM construction for real-world LLM agents and lays the foundation for broader adoption of sustainable "free lunch" artifacts in scalable AI development.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 3 likes about this paper.