Co-Reward: Joint Reward Optimization

Updated 4 July 2026

Co-Reward is a design paradigm that treats rewards as mutable variables, jointly optimized with evolving agents, morphologies, and evaluators.
It encompasses methods like morphology-aware optimization, collaborative reward modeling, and policy–reward co-evolution to adapt reward signals dynamically.
Practical applications include robot co-design, adaptive multi-agent structures, and reward sharing in game theory, yielding robust performance improvements.

Co-Reward denotes a family of reward formulations in which the reward mechanism is not treated as fixed ex ante, but is jointly constructed, adapted, or allocated together with another evolving object such as morphology, policy, evaluator ensemble, institutional state, or other agents’ payoffs. In the recent literature, this broad idea appears in robot co-design through morphology–reward joint optimization, in LLM post-training through policy–reward-model co-evolution, in collaborative reward modeling through multi-evaluator or dual-model filtering, and in game-theoretic settings through adaptive institutional incentives and reward sharing across agents (Fang et al., 30 May 2025, Shi et al., 17 May 2025, Yang et al., 20 Nov 2025, Hua et al., 2023, Kölle et al., 2023).

1. Conceptual scope

Across these works, Co-Reward can be read as an umbrella concept rather than a single formalism. The common move is to reject the assumption that reward is a stationary scalar target. Instead, reward is treated as a design variable, a learned evaluator, a shared resource, or a feedback-controlled institution.

Mode	Coupled object	Representative papers
Morphology-aware reward	Body and objective	(Fang et al., 30 May 2025, Huang et al., 2024)
Collaborative reward construction	Multiple evaluators or peer RMs	(Yang et al., 20 Nov 2025, Zhang et al., 15 May 2025)
Policy–reward co-evolution	Policy and internal/RM-based reward	(Wang et al., 3 Apr 2026, Shi et al., 17 May 2025, Hong et al., 7 Aug 2025, Liu et al., 26 Sep 2025, Guan et al., 15 Jan 2026, Tian et al., 19 Jun 2026, Altmann et al., 2023)
Social and institutional coupling	Population state or other agents’ payoffs	(He et al., 2020, Kölle et al., 2023, Szolnoki et al., 2010, Sasaki et al., 2013, Hua et al., 2023)

A recurring misconception is that Co-Reward is synonymous with ordinary reward shaping. The surveyed papers are narrower and more structured. In some cases, reward is executable code synthesized online rather than a fixed weighted template; in others, “collaboration” refers not to collaborating agents in the environment, but to collaborating evaluators or reward models; and in several social-dilemma models, co-reward means that payoff itself is redistributed or adaptively regulated rather than merely augmented by a heuristic bonus (Fang et al., 30 May 2025, Yang et al., 20 Nov 2025, Hua et al., 2023).

2. Reward as a design variable

A direct formulation of Co-Reward appears in robot co-design. RoboMoRe formalizes co-design as

$P=\langle \mathcal{M}, \Theta, \mathcal{R}, \mathcal{A}, F \rangle,$

where $\Theta$ is morphology space, $\mathcal{R}$ reward space, $\mathcal{A}(\theta,R)$ the learning procedure, and $F$ the fitness function. Standard co-design is written as

$\theta^* = \arg\max_{\theta \in \Theta} F(\pi_{\theta,R_0}),$

with a fixed reward $R_0$ , whereas RoboMoRe replaces this with joint optimization over morphology and reward,

$\theta^*, R^* = \arg\max_{\theta \in \Theta, R \in \mathcal{R}} F(\pi_{\theta,R}), \qquad \pi_{\theta,R}:=\mathcal A(\theta,R).$

The practical consequence is that reward becomes morphology-aware: the paper’s motivating example reports that a short-leg ant prefers a rolling-oriented reward, whereas a long-leg ant prefers a jumping-oriented reward. Methodologically, RoboMoRe uses coarse joint exploration of masked MuJoCo morphology parameters and executable Python reward code, followed by alternating local edits of morphology and reward. It reports that, without task-specific prompting or predefined templates, the method outperforms human-designed and competing baselines across eight tasks (Fang et al., 30 May 2025).

A closely related argument appears in ROSKA, which rejects the idea that one must find a universally good reward for training from random initialization. Instead, it proposes reward-policy co-evolution in which the next reward need only improve the current learner. Reward candidates are generated as

$\mathbf{R}_\text{DP}^{m} = \text{LLM}\left(I_d, I_e, R_\text{best}^{m-1}, V(\theta)\right),$

and candidate policies are initialized by weight-space fusion of the previous best policy and a random initialization,

$\theta_{f}^m(\alpha) = \alpha \cdot \theta_\text{best}^{m-1} + (1-\alpha)\cdot \theta_0.$

Short-Cut Bayesian Optimization then searches $\Theta$ 0 using truncated training. The selected reward–policy pair is propagated to the next round. On six Isaac Gym tasks, ROSKA reports an average normalized improvement of $\Theta$ 1 relative to Eureka while using $\Theta$ 2 of Eureka’s data budget (Huang et al., 2024).

Taken together, these papers instantiate Co-Reward as objective search conditioned on embodiment or current competence, rather than as a one-time specification of task reward.

3. Collaborative reward construction

The acronym CRM is overloaded in this literature. In one line of work, CRM denotes a Multi-Agent Collaborative Reward Model; in another, it denotes Collaborative Reward Modeling with dual reward models. The commonality is modular or peer-structured reward formation.

In the multi-evaluator CRM/MARM framework, a monolithic reward model is replaced by specialist evaluators and global evaluators. The paper describes four specialist agents—Data Optimizer, Quality Assessor, Data Synthesizer, and Data Analyzer—together with a ranker-based evaluator and an embedding-similarity evaluator. Partial signals are aggregated into a collaborative reward, for example

$\Theta$ 3

and then fused at each timestep by

$\Theta$ 4

Although several component definitions and the fusion operator remain under-specified, the conceptual shift is clear: reward is decomposed into interpretable dimensions such as accuracy, similarity, format, reasoning-step organization, and repetition penalties, with the final scalar assembled from multiple evaluators rather than emitted by one opaque model (Yang et al., 20 Nov 2025).

A different collaborative construction appears in dual-RM denoising. “Two Minds Better Than One” maintains two reward models, $\Theta$ 5 and $\Theta$ 6, each of which scores a preference pair by

$\Theta$ 7

Each model selects the top- $\Theta$ 8 fraction of likely clean preference pairs in a batch, but that subset is used to update the peer rather than itself. This peer-review mechanism is coupled with curriculum learning over easy-to-hard preferences. Under an extreme $\Theta$ 9 synthetic noise level, the method improves RewardBench by up to $\mathcal{R}$ 0 points. Here co-reward means that reward modeling itself is collaboratively filtered and denoised, not that environment agents share reward (Zhang et al., 15 May 2025).

These two strands establish an important distinction: Co-Reward can refer either to aggregation across multiple evaluators or to reciprocal supervision between reward models. In both cases, reward is explicitly modular rather than monolithic.

4. Policy–reward co-evolution

A large recent cluster of Co-Reward work couples policy improvement directly to reward adaptation. One variant uses self-generated internal reward. Self-Guide inserts a verbal self-guidance variable $\mathcal{R}$ 1 before every action,

$\mathcal{R}$ 2

then converts the same signal into internal reward

$\mathcal{R}$ 3

The reported schedule delays and later anneals $\mathcal{R}$ 4, because the shaping reward is explicitly not potential-based. Across ALFWorld, ScienceWorld, and WebShop, jointly evolving policy and internal reward with GRPO yields an average $\mathcal{R}$ 5 improvement over GRPO trained only with environment reward (Wang et al., 3 Apr 2026). DIRECT is similar in spirit but uses a co-trained discriminator over beneficial historical trajectories, mixing its output with environment reward through

$\mathcal{R}$ 6

with $\mathcal{R}$ 7 in sparse settings; the reward signal therefore co-trains with the policy rather than remaining fixed (Altmann et al., 2023). SPARK internalizes this idea within RLVR itself: the same model is trained from on-policy rollouts to answer, judge correctness, compare answers, and reflect, and the paper reports gains of $\mathcal{R}$ 8 on reasoning benchmarks, $\mathcal{R}$ 9 on reward benchmarks, and $\mathcal{A}(\theta,R)$ 0 on general benchmarks for SPARK-VL-7B (Liu et al., 26 Sep 2025).

A second variant retains an explicit reward model but updates it online. Mutual-Taught alternates an E-step that updates the policy by DPO under the current RM and an M-step that updates the RM from pseudo-preference pairs $\mathcal{A}(\theta,R)$ 1 sampled from post- and pre-update policies. This EM-like loop yields a length-controlled AlpacaEval-2 win rate of $\mathcal{A}(\theta,R)$ 2 for the 8B policy and an 8B RM that performs on par with GPT-4o-2024-08-06 on RewardBench (Shi et al., 17 May 2025). Cooper also alternates policy and RM updates, but anchors the RM with high-precision rule-based positives and reference-conditioned scoring, reporting a $\mathcal{A}(\theta,R)$ 3 gain in average accuracy on Qwen2.5-1.5B-Instruct while alleviating reward hacking (Hong et al., 7 Aug 2025). EAPO extends GRPO for long-context reasoning with

$\mathcal{A}(\theta,R)$ 4

where $\mathcal{A}(\theta,R)$ 5 is a group-relative evidence reward from a reward model updated every 20 RL steps on outcome-consistent rollouts; its strongest reported model reaches $\mathcal{A}(\theta,R)$ 6 average across eight benchmarks (Guan et al., 15 Jan 2026).

ARCO makes the co-evolution step-level and interpretable. A same-scale model $\mathcal{A}(\theta,R)$ 7 generates a local rubric

$\mathcal{A}(\theta,R)$ 8

scores it with

$\mathcal{A}(\theta,R)$ 9

and trains $F$ 0 so that the summed step rewards match the terminal outcome. The decomposition constraint

$F$ 1

provides step-level credit assignment without step labels, while $F$ 2 and $F$ 3 are jointly updated on on-policy data. Across HotpotQA, 2WikiMultiHopQA, and MuSiQue, ARCO improves the best EM in every setting over outcome-, rubric-, and process-reward baselines (Tian et al., 19 Jun 2026).

In this line of work, Co-Reward no longer means merely “better reward shaping.” It means that reward generation, reward evaluation, and policy behavior are updated in a coupled loop.

In multi-agent reinforcement learning, Co-Reward often takes the form of explicit payoff coupling. The Organization domain models agent $F$ 4’s return as the sum of a group reward $F$ 5, an individual reward $F$ 6, and a history-dependent bonus

$F$ 7

The resulting objective is neither fully shared nor fully private. In the two-agent case, the optimal policy uses $F$ 8 in weak organizational states $F$ 9, $\theta^* = \arg\max_{\theta \in \Theta} F(\pi_{\theta,R_0}),$ 0 at $\theta^* = \arg\max_{\theta \in \Theta} F(\pi_{\theta,R_0}),$ 1, and $\theta^* = \arg\max_{\theta \in \Theta} F(\pi_{\theta,R_0}),$ 2 at $\theta^* = \arg\max_{\theta \in \Theta} F(\pi_{\theta,R_0}),$ 3, showing that co-reward can encode mixed cooperative-competitive incentives rather than pure team reward (He et al., 2020).

A more structural coupling appears in reward-share trading. Each agent initially owns $\theta^* = \arg\max_{\theta \in \Theta} F(\pi_{\theta,R_0}),$ 4 of its own reward stream, then may buy fractions of others’ rewards. In the two-agent formulation, after exchanging shares $\theta^* = \arg\max_{\theta \in \Theta} F(\pi_{\theta,R_0}),$ 5 and $\theta^* = \arg\max_{\theta \in \Theta} F(\pi_{\theta,R_0}),$ 6, each realized payoff becomes a linear mixture of retained own reward and acquired foreign reward. The paper studies this in the iterated Prisoner’s Dilemma and Cleanup, and also uses the fully symmetric special case

$\theta^* = \arg\max_{\theta \in \Theta} F(\pi_{\theta,R_0}),$ 7

as an equal-distribution baseline. Empirically, trading reward shares produces mutual cooperation in IPD and role specialization in Cleanup, because the benefit one agent creates for another is no longer fully externalized (Kölle et al., 2023).

Public-goods models provide several institution-level forms of Co-Reward. In the spatial public goods game with rewarding cooperators $\theta^* = \arg\max_{\theta \in \Theta} F(\pi_{\theta,R_0}),$ 8, players who reward cooperation at a personal cost can promote cooperation when synergy is weak, but pure cooperators become second-order free-riders and moderate rewards may outperform high rewards because of cyclic dominance

$\theta^* = \arg\max_{\theta \in \Theta} F(\pi_{\theta,R_0}),$ 9

(Szolnoki et al., 2010). In threshold public goods games with voluntary reward funds, rewarders contribute $R_0$ 0 to a pool that is redistributed only among contributors, so reward can bootstrap a population out of the defection basin and later disappear while cooperation remains (Sasaki et al., 2013). At the institutional scale, adaptive reward intensity can itself be made a state variable: $R_0$ 1 This coupled system yields stable interior coexistence, and full cooperation at $R_0$ 2 is stable when

$R_0$ 3

so long-run full cooperation requires only the minimum reward level $R_0$ 4 rather than permanently maximal subsidy (Hua et al., 2023).

These social and multi-agent papers make clear that Co-Reward can mean shared reward, traded reward claims, or institutional feedback control. The unifying feature is that reward depends on collective state or on other agents’ payoffs rather than remaining strictly private and static.

6. Limitations, failure modes, and open questions

Despite broad empirical gains, the current Co-Reward literature remains methodologically uneven. Several prominent systems are only partly formalized. RoboMoRe’s “optimization momentum” terms $R_0$ 5 and $R_0$ 6 are explicitly heuristic rather than analytic derivatives; CRM/MARM leaves the central fusion operator $R_0$ 7 and several evaluator terms only conceptually specified; EAPO does not fully specify the exact normalization of evidence scores or the precise outcome-consistency filter; and ARCO itself notes that richer rubric-consistency losses could improve decomposition quality (Fang et al., 30 May 2025, Yang et al., 20 Nov 2025, Guan et al., 15 Jan 2026, Tian et al., 19 Jun 2026).

Reward hacking is also not eliminated, only reconfigured. Cooper shows the clearest pathology: a fixed VerifyRM can drive accuracy down while training reward rises toward $R_0$ 8, motivating dynamic RM updates (Hong et al., 7 Aug 2025). Self-Guide states directly that its internal reward is not potential-based and therefore anneals it late to restore alignment with environment reward (Wang et al., 3 Apr 2026). DIRECT depends on the quality and diversity of its beneficial-trajectory buffer, so a narrow or stagnant buffer can misguide the learned reward (Altmann et al., 2023). Collaborative Reward Modeling still requires a prior estimate of the noise rate through $R_0$ 9, and peer review can only help insofar as the two reward models make sufficiently different mistakes (Zhang et al., 15 May 2025).

Computational cost remains substantial. RoboMoRe’s coarse stage evaluates $\theta^*, R^* = \arg\max_{\theta \in \Theta, R \in \mathcal{R}} F(\pi_{\theta,R}), \qquad \pi_{\theta,R}:=\mathcal A(\theta,R).$ 0 morphology–reward pairs per task, each initially trained for $\theta^*, R^* = \arg\max_{\theta \in \Theta, R \in \mathcal{R}} F(\pi_{\theta,R}), \qquad \pi_{\theta,R}:=\mathcal A(\theta,R).$ 1 steps, and one such policy takes about 15 minutes on 24 CPU cores plus one RTX 4080 Super (Fang et al., 30 May 2025). ROSKA adds repeated RL runs, dynamic LLM reward generation, and BO-based policy search (Huang et al., 2024). ARCO incurs per-step rubric generation and reward-model training overhead (Tian et al., 19 Jun 2026). In mixed-motive MARL, the I-POMDP-based IA2C $\theta^*, R^* = \arg\max_{\theta \in \Theta, R \in \mathcal{R}} F(\pi_{\theta,R}), \qquad \pi_{\theta,R}:=\mathcal A(\theta,R).$ 2 results are reported only up to four agents, and the paper itself identifies scaling beyond four as future work (He et al., 2020).

The surveyed literature therefore suggests that Co-Reward is less a single theory than a design pattern: reward is treated as mutable, compositional, or state-dependent, and learning improves when that mutability is coupled to embodiment, policy distribution, evaluator structure, or collective state. What remains open is how to retain those advantages while recovering formal guarantees, scalable update rules, and clearer protections against new forms of reward exploitation.