Differentiable Evolutionary Reinforcement Learning (2512.13399v1)

Published 15 Dec 2025 in cs.AI and cs.CL

Abstract: The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.

Summary

The paper redefines reward discovery as a bi-level differentiable optimization, using meta-gradients to directly link reward modifications with agent performance improvements.
The methodology composes atomic reward primitives into symbolic reward functions, enabling robust and interpretable policy training across diverse and challenging domains.
Empirical results show DERL doubles out-of-distribution performance in robotics and surpasses traditional heuristics in mathematical reasoning tasks.

Differentiable Evolutionary Reinforcement Learning: A Bi-level Framework for Automated Reward Discovery

Problem Setting and Motivation

Reward function design remains a major obstacle in reinforcement learning, fundamentally impacting agent alignment and generalization, especially in complex tasks that require compositional reasoning. Traditional hand-crafted reward engineering suffers from brittleness and reward hacking, while human-in-the-loop paradigms such as RLHF incur substantial annotation costs and impede scalability. Standard evolutionary approaches, though automating some aspects of configuration search, operate in a derivative-free, black-box regime and are unable to capture the causal relationships between reward structures and downstream agent performance. Consequently, these methods are sample-inefficient and fail to leverage the optimization landscape's structure, constraining their applicability in settings with long horizons or compositional objectives.

DERL Framework

Differentiable Evolutionary Reinforcement Learning (DERL) addresses these limitations by recasting reward discovery as a bi-level, differentiable optimization problem:

Inner Loop: The policy model is trained under a parameterized reward function, leveraging atomic reward primitives (e.g., partial correctness, format compliance, process heuristics) evaluated over the agent's trajectory.
Outer Loop: A Meta-Optimizer, implemented as a trainable policy (e.g., LLM or lightweight model), generates symbolic reward definitions by composing atomic primitives. The Meta-Optimizer is updated using policy gradients driven by the inner-loop policy's validation performance.

Crucially, this bi-level arrangement enables the outer loop to learn a "meta-gradient", directly tying modifications in the reward parameterization to improvements in agent performance. Rather than relying purely on outcome signals or heuristic aggregation of reward signals, DERL autonomously discovers dense, actionable feedback signals, facilitating the emergence of reward structures aligned with true task objectives without human intervention.

Reward Parameterization

Reward functions are constructed as symbolic compositions over a set of atomic primitives $G = \{g_1, g_2, ..., g_k\}$ , each providing interpretable and task-relevant feedback. The Meta-Optimizer predicts both the structure and weighting over these primitives, resulting in composite rewards that can express linear combinations, normalizations, logical constraints, and other mathematical relationships. This compositional approach expands the reward search space substantially beyond outcome-based or averaged heuristics, while retaining tractability and interpretability.

Optimization Procedure

Inner Loop Training: Implements Group Relative Policy Optimization (GRPO), whereby policy updates are guided by group-wise advantage estimates computed from the current parameterized reward.
Outer Loop Evolution: Operates through batch rollouts of reward configurations, evaluating each by training an associated policy agent and scoring on a held-out validation set. The Meta-Optimizer is then updated using group-wise advantages with respect to the achieved validation scores.

Initialization strategies include standard reinitialization of the policy model per iteration or a population-based variant (DERL-pop.) where the best-performing model from the prior generation seeds the next, enabling curriculum-like adaptation of rewards as policies mature.

Empirical Evaluation

DERL is validated across three significantly different and demanding domains:

Robotic Agents (ALFWorld): Embodied task completion under language and visual instructions, with stringent out-of-distribution generalization requirements.
Scientific Simulation (Science World): Agents must reason about interactive scientific environments, again under both in-distribution and O.O.D. splits.
Mathematical Reasoning (GSM8K, MATH): LLMs solve math problems where dense and reliable reward feedback is vital, yet standard binary outcome signals are sparse.

On all benchmarks, DERL and its population-based variant (DERL-pop.) achieve state-of-the-art success rates, including under O.O.D. conditions where standard heuristics collapse. On ALFWorld and ScienceWorld, DERL doubles O.O.D. performance relative to outcome-reward baselines (e.g., 65.0% vs. 29.7% on ALFWorld-L2). For mathematical reasoning, DERL consistently exceeds outcome and heuristic rewards (60.2% vs. 58.8% on MATH), successfully navigating the trade-off between feedback density and fidelity.

Analysis of Evolution Dynamics

The evolutionary trajectory of generated reward structures reveals rapid convergence toward numerically stable and robust combinations—primarily linear normalizations and bounded aggregations. Unstable or adversarial forms, such as unbounded products or reward "vetoes", are quickly suppressed by the meta-gradient-driven outer loop, even without explicit penalization. This indicates that DERL's Meta-Optimizer internalizes the structural properties required for effective agent training and generalization.

Furthermore, visualization of training dynamics demonstrates monotonic improvement in both inner- and outer-loop task performance, ruling out overfitting to exploration set or degenerate meta-discovery.

Implications and Limitations

Theoretical and Practical Implications

The formulation of reward search as a differentiable, bi-level optimization problem constitutes a marked shift from black-box evolutionary algorithms. DERL validates that meta-gradients—analogous to those exploited in meta-learning and differentiable architecture search—can be reliably estimated and effectively utilized for reward modeling, with convergence toward meaningful, generalizable structures.

Practically, DERL eliminates the need for human-in-the-loop reward design or large-scale annotation, instead enabling scalable, autonomous agent training in domains where reward hacking, credit assignment, or distribution shift are salient challenges. This is particularly impactful with LLM-based agents tackling tool use, code generation, or scientific discovery, where reward design is a core bottleneck.

Computational and Expressivity Constraints

The principal limitation of DERL is computational: each outer-loop update requires training multiple (potentially large) inner-loop agents, resulting in significant compute cost. Strategies such as population initialization (DERL-pop.), parallelization, sampling-based outer-loop optimization, or lightweight proxy-based bootstrapping are identified as paths to improved efficiency.

Expressivity is bounded by the defined set of atomic primitives; the Meta-Optimizer cannot invent new forms of reward feedback outside the initial grammar. Automatically expanding the primitive set or learning primitives from demonstrations remains an open area.

Credit assignment over extremely long horizons, where evaluation signal propagation is weak, may still present challenges, motivating investigation into intermediate supervision or hierarchical meta-optimization.

Future Directions

Possible extensions of DERL include:

Incorporation of meta-learned or self-extracted atomic primitives to increase reward space expressivity.
Integration with lightweight or amortized RL in the inner loop for sample and compute efficiency.
Employing intermediate meta-supervision heuristics or subsampling to address horizon scaling.
Deployment in large-scale, open-ended agentic systems and continual learning scenarios, evaluating the robustness and adaptability of meta-optimized reward structures over time.

Conclusion

Differentiable Evolutionary Reinforcement Learning introduces a principled, end-to-end differentiable strategy for reward discovery in reinforcement learning, leveraging bi-level optimization to align autonomous agents with complex objectives in a scalable, annotation-free manner. The approach bridges the gap between evolutionary black-box search and gradient-based structure discovery, facilitating dense, generalizable feedback signals essential for advanced reasoning and robust policy training. Empirical results across multiple domains substantiate DERL's theoretical foundations and practical efficacy, positioning it as a strong step toward fully autonomous reinforcement learning systems capable of self-directed reward modeling and curriculum generation.