Agent-Centric Reward Models

Updated 18 April 2026

Agent-centric reward models are frameworks where multiple specialized evaluators dynamically shape reward signals to enhance robustness and interpretability in reinforcement learning.
They utilize process-oriented scoring and sophisticated aggregation techniques, such as weighted fusion and gated multi-objective methods, to provide localized, high-fidelity feedback.
These models are applied in RLHF, multi-tool reasoning, and embodied agent tasks, demonstrating significant improvements in sample efficiency and performance across benchmarks.

Agent-centric reward models refer to reward modeling frameworks in which the reward signal is constructed, shaped, or aggregated explicitly from the perspective of the agent—often leveraging multiple agentic evaluators, structured process feedback, or even reward function self-adaptation—to enhance RL pipeline robustness, generalization, and interpretability. This stands in marked contrast to classical scalar reward functions, which typically provide a single, opaque supervisory signal without interpretable decomposition or dynamic adaptation. Agent-centric models are deployed extensively in advanced RLHF, multi-tool reasoning, embodied multimodal agents, and automated task planning for LLM agents.

1. Multi-Agent and Agentic Reward Model Architectures

Contemporary agent-centric reward models decompose the reward modeling problem into multiple specialized evaluators or "agent" components, each tasked with scoring a particular dimension of the agent's output—such as factuality, safety, logical coherence, or robust exploration. In the CRM (Multi-Agent Collaborative Reward Model) framework (Yang et al., 20 Nov 2025), the reward system is comprised of:

Specialist Evaluator Agents:
- Data Analyzer: applies repetition penalties and stability checks
- Data Optimizer: promotes rollout diversity and exploration bonuses
- Quality Assessor: scores factuality, coherence, and step-wise correctness
- Data Synthesizer: injects counterfactual perturbations for robustness
Global Evaluators:
- Ranker-based rewards $R_{\text{ranker}}(o)$ , learned via pairwise preference modeling
- Embedding-similarity rewards $R_{\text{sim}}(o)=\cos(\mathbf{h}_{\text{pred}},\mathbf{h}_{\text{ref}})$

A centralized aggregator computes the final agent reward by a (potentially nonlinear) weighted fusion of all partial signals: $r_t = \mathcal{F}\left(\alpha\,r_{\text{ranker}}^t + \beta\,r_{\text{sim}}^t + \sum_{i=1}^K \lambda_i\,r_i^t\right)$ Learned weights and optional nonlinear scaling or agreement-enforcing penalties encourage robustness and interpretability.

RLAR (Feng et al., 28 Feb 2026) extends agent-centricity to the meta-level: the reward system itself is orchestrated by LLM agents that dynamically select or synthesize new reward tools in response to distributional shifts, leading to a self-evolving system that can seamlessly interpolate between static and on-the-fly specialized scoring.

2. Process and Step-wise Reward Modeling

Agent-centric reward models frequently adopt step-wise or process-oriented scoring mechanisms that provide localized feedback at each decision point, rather than unimodal trajectory-level or final-outcome rewards. Notable instantiations include:

AgentPRM (Process Reward Models) (Xi et al., 11 Nov 2025): Each state–action pair $(s_t, a_t)$ is scored by a learned function $M_\phi(s_t, a_t)$ approximating the action-value (promise) $Q_\pi(s_t, a_t)$ and per-step advantage (progress):

$L_{Q}(\phi) = \mathbb{E}\left[ \tfrac12 (M_\phi(s_t, a_t) - \widehat{Q}(s_t, a_t))^2 \right]$

$L_A(\phi) = \mathbb{E}\left[ \tfrac12 ((M_\phi(s_t, a_t) - M_\phi(s_{t-1}, a_{t-1})) - (\widehat{Q}(s_t, a_t) - \widehat{Q}(s_{t-1}, a_{t-1})))^2 \right]$

The reward model provides local, progress-sensitive guidance and supports policy optimization and test-time search.

CUARewardBench (Lin et al., 21 Oct 2025): Explicitly separates outcome (ORM) and process (PRM) labels for step-level accuracy, with UPE (Unanimous Prompt Ensemble) providing high-precision step/trajectory labeling by ensemble voting across diverse VLMs and prompt templates.

Step-level agent-centric reward models enable granular assessment of reasoning, encourage compositional skill acquisition, and avoid the sparsity/ambiguity problems of trajectory-only reward systems.

3. Aggregation and Multi-Objective Fusion

Agent-centric architectures must reconcile conflicting objectives (e.g., accuracy, safety, helpfulness). The norm is to aggregate signals by explicit weighted sums, nonlinear Gating, or Pareto-inspired scalarizations:

Linear/Nearlinear Aggregation:

E.g., CRM employs

$R_{\text{collab}}(o_t) = \alpha\,r_{\text{acc}}^t + \beta\,r_{\text{sim}}^t + \gamma\,r_{\text{fmt}}^t + \delta\,r_{\text{step}}^t - \eta\,r_{\text{rep}}^t$

Gated Multi-objective Fusion:

Argos (Tan et al., 3 Dec 2025) uses fixed thresholds and weighted aggregation across outcome accuracy, spatiotemporal grounding, and reasoning-trace quality:

$R_{\text{final}} = \begin{cases} R_{\text{acc}}, & R_{\text{acc}} < \tau \ \frac{w_a R_{\text{acc}} + w_s R_{\text{spatial}} + w_r R_{\text{reasoning}}}{w_a + w_s + w_r}, & R_{\text{acc}} \geq \tau \end{cases}$

Pareto-optimality theory underlies the guarantee that such aggregation, when deployed in batch sampling, can recover high-performing, multi-objective-optimal solutions (Tan et al., 3 Dec 2025).

Dynamic Routing and Specialization:

RLAR and RewardAgent (Peng et al., 26 Feb 2025) include an agentic router that, for each query/context, selects among a pool of correctness agents (factuality, instruction following, etc.) and combines signals adaptively.

4. Meta-Learning, Self-Evolving, and Intrinsic Agent-Driven Rewards

Agent-centric reward modeling is not limited to multi-component scoring but can also encompass self-adaptive and meta-learned reward functions.

Agent-Driven Meta-Reward Shaping:

In EXPLORS (Devidze, 27 Mar 2025), the agent meta-learns an intrinsic reward $R_{\text{sim}}(o)=\cos(\mathbf{h}_{\text{pred}},\mathbf{h}_{\text{ref}})$ 0 (parameterized by $R_{\text{sim}}(o)=\cos(\mathbf{h}_{\text{pred}},\mathbf{h}_{\text{ref}})$ 1) jointly with the policy:

$R_{\text{sim}}(o)=\cos(\mathbf{h}_{\text{pred}},\mathbf{h}_{\text{ref}})$ 2

Inner-loop policy optimization and outer-loop reward parameter updates (via meta-gradients) maximize long-term task performance or exploration value, absent any external teacher.

Representational Empowerment (Internal Knowledge-centric Reward):

Instead of externally anchored reward, the agent maximizes the channel capacity between its own knowledge modification operations and future internal states:

$R_{\text{sim}}(o)=\cos(\mathbf{h}_{\text{pred}},\mathbf{h}_{\text{ref}})$ 3

The objective is to cultivate an internal representation library that is both diverse and controllable (Zhou et al., 29 Jul 2025).

RLAR’s Agentic Reward Synthesis:

RLAR (Feng et al., 28 Feb 2026) takes an explicitly agent-driven approach, where LLM agents synthesize or retrieve new verifier RMs to match evolving task distributions, outperforming static RMs on out-of-distribution tasks.

These paradigms generalize agent-centrism beyond fixed signal aggregation to continual, self-modifying, or introspective reward design.

5. Trajectory-Level and Robust Multi-Agent Reward Modeling

Agent-centric reward models extend naturally to the trajectory level and to multi-agent scenarios:

Plan-RewardBench & Trajectory-Level Preference Models:

Trajectory-level RMs operate on full agent plans, with scalar functions $R_{\text{sim}}(o)=\cos(\mathbf{h}_{\text{pred}},\mathbf{h}_{\text{ref}})$ 4 and pairwise preference modeling:

$R_{\text{sim}}(o)=\cos(\mathbf{h}_{\text{pred}},\mathbf{h}_{\text{ref}})$ 5

Empirical results show that both discriminative and generative RMs face severe degradation on long-horizon, complex, tool-augmented trajectories, emphasizing the need for specialized training and agentic supervision (Wang et al., 9 Apr 2026).

Resilience and Mixed-Motive Multi-Agent Environments:

Cooperative resilience is formalized by tracking global well-being indicators across disruption, learning reward parameters (linear or neural) from behavioral ranking data:

$R_{\text{sim}}(o)=\cos(\mathbf{h}_{\text{pred}},\mathbf{h}_{\text{ref}})$ 6

This hybrid agent-centric reward enables improved robustness under disruption and reduces catastrophic failures (Chacon-Chamorro et al., 29 Jan 2026).

6. Integration, Empirical Findings, and Benchmarks

Empirical benchmarks demonstrate that agent-centric reward models confer marked gains in generalization, robustness, and sample efficiency:

CRM (4-agent): GSM8K accuracy of ~27.6%, RewardBench reasoning score ≈0.690, with observed improvements in reasoning metrics and training stability versus scalar baselines (Yang et al., 20 Nov 2025).
RLAR: 10–60% improvement across math, code, translation, and dialogue; best agent-based selection accuracy of 90.44% vs. 87.19% (static) on RewardBench-V2 (Feng et al., 28 Feb 2026).
Agent-RewardBench and CUARewardBench define explicit, granular evaluation protocols for agent-centric (step-wise, multi-dimensional) reward in multimodal and computer-using agent settings (Lin et al., 21 Oct 2025, Men et al., 26 Jun 2025).
AgentRM demonstrates explicit process reward modeling provides +8.8 points over greedy policy and superior scaling properties (Xia et al., 25 Feb 2025).
Argos and Agent-RRM show that aligned agent-centric, multi-faceted rewards unlock state-of-the-art performance in multimodal reasoning, robotics, and reasoning-intensive agent tasks (Tan et al., 3 Dec 2025, Fan et al., 29 Jan 2026).

A representative list of frameworks and benchmarks:

Model or Benchmark	Key Features	Reference
CRM	Multi-agent, partial-signal aggregation, RLHF	(Yang et al., 20 Nov 2025)
RLAR	Dynamic tool synthesis, agentic retrieval	(Feng et al., 28 Feb 2026)
AgentPRM	Process, step-wise Q-value/advantage	(Xi et al., 11 Nov 2025)
AgentRM	Explicit/implicit RM, LLM-judge, beam/BofN	(Xia et al., 25 Feb 2025)
RewardAgent	Human pref. + factual/IF agents	(Peng et al., 26 Feb 2025)
CUARewardBench	Step/outcome labels, UPE on vision agents	(Lin et al., 21 Oct 2025)
Agent-RewardBench	Multimodal, step-pair, three-dimension eval	(Men et al., 26 Jun 2025)
Plan-RewardBench	Trajectory-level, planning, safety/robustness	(Wang et al., 9 Apr 2026)

7. Future Directions and Open Challenges

Open directions in agent-centric reward modeling include:

Enhancing aggregation schemes for more complex multi-objective trade-offs and better theoretical guarantees (Pareto-optimality, explicit fairness, coverage).
Enabling fully self-adaptive, introspective reward shaping without performance collapse or reward hacking, especially under sparse, delayed, or adversarial environments.
Extending step-level or agent-centric reward models to continuous domains, richer perceptual spaces, and large-scale multi-agent or human-in-the-loop systems.
Development of scalable, robust agentic verifiers and evaluators for safety-critical, open-ended, or knowledge-intensive tasks.
Benchmarking under novel, long-horizon, and tool-integrated scenarios to further expose the limitations and strengths of agent-centric designs.

The agent-centric paradigm is converging toward modular, interpretable, and extensible reward composition, where specialization, dynamic adaptation, and structured feedback are paramount for aligning complex autonomous agents with both human preferences and verifiable constraints (Yang et al., 20 Nov 2025, Feng et al., 28 Feb 2026, Lin et al., 21 Oct 2025, Xi et al., 11 Nov 2025).