SPARK: Synergistic Policy and Reward Co-Evolution

Updated 29 September 2025

The paper introduces SPARK, a framework that jointly evolves policy and reward models using recycled rollout supervision for enhanced learning efficiency.
It employs on-policy rollout recycling and auxiliary objectives for dual optimization of generative and evaluative functions in diverse agent architectures.
SPARK demonstrates superior sample efficiency and performance in multi-agent, multi-task, and robotic applications compared to traditional reinforcement learning methods.

The Synergistic Policy And Reward Co-Evolving Framework (SPARK) defines a class of reinforcement learning methodologies in which both the policy and the reward model are concurrently adapted and improved through mutually reinforcing feedback loops. Unlike conventional RL approaches relying either on fixed, externally provided reward functions or on separately trained reward models, SPARK unifies the learning of the generative (policy) and evaluative (reward) aspects within the same model or interlinked modules—recycling rich supervision signals that would otherwise be discarded, and facilitating both efficient training and robust generalization. This paradigm can be instantiated across a wide variety of agent architectures, ranging from LLMs and vision–LLMs (LVLMs) in post-pretraining optimization (Liu et al., 26 Sep 2025), to multi-agent and multi-task systems, and robotic skill acquisition frameworks (Huang et al., 18 Dec 2024, Iqbal et al., 2019, Fang et al., 30 May 2025).

1. Core Principles and Motivation

SPARK is characterized by the tight coupling of policy evolution and reward modeling under an overarching goal: maximizing learning efficiency and robustness while minimizing waste of supervision and reliance on costly externally annotated data. Traditional RL paradigms—such as RL from human feedback (RLHF) and RL with verifiable rewards (RLVR)—either rely on separate reward models, hand-crafted rewards, or costly preference annotations. In contrast, SPARK recycles all generated rollouts and correctness signals, using them to inform not only the policy gradient but also the auxiliary optimization of the model’s own evaluative (reward) function (Liu et al., 26 Sep 2025).

This tightly coupled learning protocol results in a positive feedback loop: improved reward accuracy yields higher-quality policy gradients; better policies generate higher-quality rollouts; richer rollouts further refine the reward function, and so on. SPARK generalizes beyond single-agent contexts to support multi-agent coordinated exploration through intrinsic rewards (Iqbal et al., 2019) and multi-task learning with centralized knowledge sharing (Ma et al., 20 Aug 2024).

2. Methodological Structure

SPARK’s algorithmic workflow is distinguished by its systematic recycling of rollout and reward data for dual purposes.

Rollout Recycling and Auxiliary Objectives

On-Policy Rollout Generation: The model generates candidate outputs in response to task queries.
Reward Calculation: Each candidate is assigned a verifiable reward (e.g., based on correctness or adherence to reference).
Dual Use of Rollouts:
- Policy Update: Advantages computed per rollout (normalized, e.g., $A_i = \frac{r_i - \bar{r}}{s + \epsilon}$ with $s$ as standard deviation) drive policy gradient optimization, optionally regularized via KL divergence to a reference policy.
- Reward Model Training: The same rollouts (with their correctness) serve as supervision for generative reward modeling via multiple auxiliary objectives:
  - Pointwise: Learning to assign correct labels to individual outputs.
  - Pairwise: Learning comparative judgment between multiple candidates.
  - Reflection/Self-Correction: Generating improved output via model-internal evaluation and iterative refinement.

Co-Evolving Feedback Loop

This data recycling enables SPARK to connect policy and reward improvement: advances in reward judgment translate to sharper policy updates, producing further candidate improvement—a self-reinforcing learning circuit (Liu et al., 26 Sep 2025).

Unified Model Architecture

SPARK may instantiate both the policy and reward evaluator within a single model, eliminating the need for a separate reward module. This approach is used in both LLM–based reasoning systems (Liu et al., 26 Sep 2025) and co-design frameworks blending morphology and reward shaping in robotics (Fang et al., 30 May 2025).

3. Domain Extensions and Variants

SPARK’s conceptual apparatus generalizes to systems ranging from single-agent LLMs to multi-agent and robotic policy optimization.

Multi-Agent Synergy

Papers such as (Iqbal et al., 2019, Chitnis et al., 2020) extend SPARK to multi-agent RL via intrinsic rewards shaped by shared or cross-agent exploration statistics. Agents coordinate to maximize joint exploration utility, implementing hierarchical or meta-control structures that dynamically select modalities based on evolving task demands.

Frameworks incorporating centralized reward agents (CRA) (Ma et al., 20 Aug 2024) extend SPARK by enforcing knowledge distillation and reward shaping across multiple tasks, utilizing auxiliary agents to distribute dense, informative rewards that facilitate transfer and convergence in sparse reward settings.

Robotic Co-Design

SPARK underpins concurrent optimization of policy, reward, and robot morphology (Fang et al., 30 May 2025). Here, LLM-driven diversity reflection and alternating refinement stages jointly evolve structure and reward, enabling robots to discover motion behaviors suited to their unique morphologies.

Preference Learning and Resistance to Reward Hacking

SPARK’s co-evolution of policy and reward models mitigates distributional shifts and reward hacking in LLMs (Shi et al., 17 May 2025, Hong et al., 7 Aug 2025), dynamically updating evaluative models to track improvements in policy output, and ensuring better alignment with human preferences or rule-based correctness.

4. Empirical Performance and Scalability

Across benchmarks covering mathematical reasoning, multimodal understanding, and robotics, SPARK consistently demonstrates improved performance and sample efficiency.

In (Liu et al., 26 Sep 2025), SPARK-VL-7B achieves a 9.7% average gain on math reasoning tasks, 12.1% improvement on reward model benchmarks, and 1.5% on general multimodal tasks.
Robotic skill acquisition frameworks (Huang et al., 18 Dec 2024) report an average normalized improvement of 95.3% over baselines, using only 89% of the data compared to alternative co-evolution strategies.
Highway driving reward evolution (Han et al., 15 Jun 2024) via LLMs yields a 22% higher average success rate on simulated safety tasks compared to handcrafted rewards.
In reinforcement learning with LLMs, dynamic co-optimization resists reward hacking and achieves accuracy improvements (e.g., 0.54% gain in (Hong et al., 7 Aug 2025)).

Scalability is ensured by the elimination of separate reward modules and external annotations. On-policy, recycled supervision enables efficient training and self-reflective test-time scaling without additional cost.

5. Mathematical Formulation

SPARK formalizes the dual optimization objective as follows. For model parameters $\theta$ and reference $\pi_\text{ref}$ ,

$\mathbb{E}_{o \sim \pi_\theta} [R(q, o)] - \lambda\ \text{KL}(\pi_\theta(\cdot|q) \,\|\, \pi_\text{ref}(\cdot|q))$

where $R(q, o)$ is a verifiable (possibly reference-based) reward, often $R(q, o) = \mathbb{I}\{o = a\}$ for ground truth $a$ .

Auxiliary reward objectives include:

Pointwise: binary classification or regression of reward labels,
Pairwise: Bradley-Terry or contrastive objectives,
Self-reflective update: iterative output refinement based on internal model judgment.

In multi-agent and co-design extensions: $\theta^*, R^* = \underset{\theta \in \Theta, r \in \mathcal{R}}{\arg\max} \, F(\pi_{\theta,r})$ as in (Fang et al., 30 May 2025), with $F$ a domain-specific fitness function (e.g., normalized locomotion distance).

6. Comparative Analysis and Practical Considerations

SPARK contrasts with RLHF and RLVR by eliminating dependency on human preference data and external reward models, reducing compute and costs. Co-evolving reward and policy components foster resilience against reward hacking and distribution shift, as demonstrated in LLM preference optimization (Shi et al., 17 May 2025).

In multi-agent domains, SPARK’s use of intrinsic rewards coordinated across agents addresses redundancy and facilitates division of exploration labor (Iqbal et al., 2019, Chitnis et al., 2020). Centralized reward agents enable knowledge transfer in multi-task RL (Ma et al., 20 Aug 2024).

Robotic instantiations of SPARK harness LLMs for automated reward and morphology design, yielding morphologically diverse, task-optimal systems without manual templates (Fang et al., 30 May 2025, Huang et al., 18 Dec 2024).

The unification of policy and reward training, test-time self-reflection, and recycling of all supervision signals represents a theoretical and practical advance over prior RL frameworks, supporting robust, scalable, and generalizable learning across domains.

7. Applications and Future Directions

SPARK is applicable to safety-critical LLM reasoning, visual question answering, multi-agent coordination, autonomous driving, robot skill learning, and multi-task RL. By enabling on-the-fly reward adaptation and coordinated policy improvement, SPARK provides a foundation for agents that continuously self-improve and autonomously align with evolving task objectives.

Future research within the SPARK paradigm includes:

Extension to action sequence and temporal reward shaping in multi-agent synergy (Chitnis et al., 2020).
Scaling to larger agent populations and real-world robotic systems.
Integration with institutional and population-level adaptive reward mechanisms for stable social cooperation (Hua et al., 2023).
Exploration of reflexive and self-repairing evaluative processes in large-scale generative modeling and dialog agents.

SPARK’s principles are increasingly reflected in state-of-the-art frameworks for policy learning, reward modeling, and intelligent system design. Its capacity for unified, synergistic evolution of policy and reward mechanisms defines a general strategy for efficient, robust, and autonomous artificial intelligence development.