RLEIF: Reinforcement Learning from Evolving Instructions

Updated 13 November 2025

The paper introduces RLEIF, a novel methodology that merges iterative instruction evolution with reinforcement learning to optimize alignment with human preferences.
It systematically combines instruction mutation, process-level reward modeling, and policy updates (using methods like PPO) to enhance performance across domains such as autonomous driving and math reasoning.
Empirical results demonstrate that RLEIF-derived rewards outperform expert-crafted and conventional RLHF approaches, offering robust, interpretable, and generalizable improvements.

Reinforcement Learning from Evol-Instruct Feedback (RLEIF) is a methodology for aligning machine-learned behaviors or model outputs with human preferences via iterative, evolution-inspired feedback and instruction refinement. RLEIF systematically combines instruction evolution, process-level reward modeling, and reinforcement learning (often Proximal Policy Optimization, PPO), operationalized either through explicit reward function code or high-quality, evolving data instructions. It has been applied to diverse domains such as autonomous driving reward design and mathematical reasoning in LLMs, offering quantifiable improvements over both manually crafted and previous learning-from-human-feedback approaches.

1. Conceptual Foundations

RLEIF extends the paradigm of reinforcement learning from human feedback (RLHF) by integrating the notion of "evolving instructions" through an outer-loop guided by targeted human (or synthetic) feedback. The key distinction lies in treating the generation, modification, and selection of instructions or reward functions as an iterative search problem, where the instruction set itself undergoes evolution, influenced by explicit or process-level assessment of resulting behaviors.

The principal workflow alternates between:

Instruction or reward generation/mutation via LLMs or generative algorithms.
Policy or model training on the candidate instructions/rewards.
Collection of feedback (human preference, scalar ratings, or fine-grained process supervision).
Aggregation of feedback into fitness or reward scores, often using ranking systems such as Elo.
Evolutionary selection, propagation, or mutation of superior instructions/reward functions.

This paradigm enables the automated discovery and continual improvement of reward definitions or data curricula which encode otherwise tacit human knowledge, overcoming the limitations of static, expert-specified objectives.

2. RLEIF in Reward Design: The REvolve Framework

REvolve instantiates RLEIF in the context of autonomous driving, humanoid locomotion, and dexterous manipulation, focusing on synthesizing interpretable reward function code aligned with human evaluative criteria.

Core Algorithmic Loop

REvolve maintains a population of candidate reward functions distributed across a set of islands (sub-populations). Each iteration proceeds as follows:

An LLM (e.g., GPT-4) proposes new reward functions either via mutation or crossover, informed by bundled natural language (NL) critiques accumulated in previous generations.
For each reward function $R$ , an RL agent is trained (using, for example, Clipped Double DQN with dueling networks and Prioritized Experience Replay). The agent's behavioral trajectories are then evaluated.
Human evaluators perform pairwise comparisons of agent behaviors, assigning preferences and providing qualitative comments. These are mapped into fitness scores using an Elo rating system, updating each candidate's scalar fitness $\sigma$ and feedback bundle $\lambda$ .
Islands are propagated using evolutionary operators, favoring candidates that raise average fitness, thus selecting for reward functions that elicit more human-preferred behaviors.

Pseudocode Overview (excerpted for precision):

for generation in 1..N:
    for i in 1..K:
        if mutate:
            R_new ← LLM.mutate(island[P], feedback=λ)
        else:
            R_new ← LLM.crossover(island[P], feedback=λ)
        π_new ← TrainPolicy(R_new)
    Score each π_new via human Elo + NL feedback
    Replace worst in island if π_new improves avg fitness

The mathematical objective is: $R^* = \arg\max_{R \in \mathcal{R}} \mathbb{E}_{\tau \sim \pi(R)}[\sigma(\tau)]$ where $\sigma(\tau)$ derives from human preference aggregation.

Policy Training

The evolved reward $R_\theta$ is directly embedded into the RL loop. Reward assignment at each timestep follows $r_t = R_\theta(s_t, a_t, s_{t+1})$ , with value updates (for DQN family) performed by minimizing the temporal-difference error: $\mathcal{L}(\phi) = \mathbb{E}_{(s, a, r, s') \sim D}\left[\left(r + \gamma \min_{k=1,2} Q_{\bar\phi_k}(s', \arg\max_a Q_\phi(s',a)) - Q_\phi(s, a)\right)^2\right]$ Here, the dueling Q-architecture decomposes $Q(s,a) = V(s) + A(s,a)$ .

Empirical Results

On AirSim autonomous driving, REvolve-overseen RL achieves average episode lengths of $\sim900$ (steps without crash) versus $\sim750$ for expert-designed rewards and Elo fitness $\sigma\approx0.84$ , outperforming all prior baselines, including greedy LLM-based search (Eureka, $\sigma\approx0.72$ ) and automatic code-only fitness (REvolve Auto, $\sigma\approx0.81$ ). Out-of-distribution evaluation also shows REvolve-derived rewards generalize more robustly.

Ablation studies confirm NL feedback's importance (full REvolve outperforms REvolve Auto), and the benefits of evolutionary search over greedy optimization.

3. RLEIF in Instruction-Induced Reasoning: The WizardMath Approach

WizardMath applies RLEIF to mathematical reasoning in LLMs, integrating curriculum-like instruction evolution and granular, process-based reward modeling.

Training Pipeline

(A) Supervised Fine-Tuning (SFT)

Initial SFT is performed on $\approx15\,000$ instruction–solution pairs, filtered for correctness and CoT structure. Loss is cross-entropy: $L_{SFT}(\theta) = - \mathbb{E}_{(\text{instr}, \text{sol})} \sum_{t=1}^T \log\pi_\theta(y_t \mid \text{instr}, y_{<t})$

(B) Reward Model Training

Two separate models:

Instruction Reward Model (IRM, $\varphi$ ): Scores instruction quality, trained via pairwise ranking (hinge loss):

$L_{IRM}(\varphi) = \mathbb{E}_{i_{+},i_{-}} \left[\max(0, 1 - (s_\varphi(i_{+}) - s_\varphi(i_{-})))\right]$

Normalized via $r^I = \text{sigmoid}(s_\varphi(i'))$ .

Process-supervised Reward Model (PRM, $\psi$ ): Scores CoT step correctness, trained via binary cross-entropy:

$L_{PRM}(\psi) = -\mathbb{E}_{(step, y)} \left[y \log r^A_\psi(step) + (1-y) \log(1-r^A_\psi(step))\right]$

Overall answer reward $r^A = \prod_{k=1}^K r^A_k$ across steps.

(C) Evol-Instruct + PPO

Over 8 "evolution turns," the instruction pool expands from $15k$ to $\approx96k$ , combining upward (constraint-adding, difficulty-increasing) and downward mutations. The final scalar reward is $r = r^I \cdot r^A$ . Policies are updated using PPO with a clipped surrogate and KL penalty: $L_{PPO}(\theta) = \mathbb{E}_t \left[ \min(\rho_t A_t, \text{clip}(\rho_t,1-\epsilon,1+\epsilon) A_t) \right] - \beta \mathbb{E}_t[\text{KL}[\pi_\theta(\cdot|s_t)||\pi_{ref}(\cdot|s_t)]]$ with typical hyperparameters $\epsilon=0.2$ , $\beta\approx0.1$ .

Results and Ablation

WizardMath 70B achieves a pass@1 of $81.6\%$ on GSM8k and $22.7\%$ on MATH, exceeding both Llama-2 70B and several proprietary LLMs, including GPT-3.5-Turbo on GSM8k. Ablations show step-level process supervision (PRM) is critical, yielding improvements of $\approx+10\%$ absolute on challenging tasks, with full IRM+PRM+PPO delivering cumulative gains.

4. Mathematical Formalism

Both REvolve and WizardMath instantiate RLEIF as a constrained maximization of expected reward/fitness over an evolving instruction or reward function population: $R^* = \arg\max_{R \in \mathcal{R}}\, \mathbb{E}_{\tau \sim \pi(R)}[\sigma(\tau)]$ where $\sigma(\tau)$ is human- or model-derived fitness. For WizardMath, the per-step and instruction-level rewards are multiplicative, $r^I \cdot r^A$ , with PPO updates regularized by KL divergence to stabilize learning.

The key technical feature is the synthesis of evolutionary optimization (mutation/crossover/evaluation) with gradient-based policy improvement, where "success" is empirically verified either by human raters (REvolve) or reward models trained on granular, process-level data (WizardMath).

5. Empirical Insights and Comparative Analysis

Empirical investigations highlight several consistent findings:

Evolutionary outer-loop search (using populations and selective pressure) delivers more robust and generalizable reward/instruction candidates than greedy or purely numeric search.
Process-level feedback (e.g., stepwise correctness in math) yields markedly better performance compared to outcome-only supervision.
Human natural language comments, when incorporated into mutation/crossover prompts, induce superior improvements, as evidenced by comparative results on REvolve with and without NL feedback ( $\sigma=0.84$ vs $\sigma=0.81$ ).
The best RLEIF-trained agents and models approach or exceed human and expert-crafted baselines in complex, subjective, or open-ended domains.

The following comparison synthesizes domain-specific benchmarking from the cited papers:

System	Domain	Main Metric	Baseline Perf.	RLEIF Perf.
REvolve	Driving	Elo Fitness σ	~0.65 (expert)	0.84
WizardMath 70B	GSM8k (math)	Pass@1 (%)	56.8 (Llama2)	81.6
WizardMath 70B	MATH (math)	Pass@1 (%)	13.5 (Llama2)	22.7

6. Significance, Limitations, and Implications

RLEIF establishes a formalism and set of methods for closing the loop between machine learning systems and human operators, specifically in domains where reward design or data creation demands iterative refinement. Its empirical successes derive from the dual mechanisms of population-based search (diversity, robustness) and fine-grained/process-level evaluative models (granular alignment).

Notable limitations include the potential human-in-the-loop bottleneck (for pairwise or process feedback), the need for non-differentiable fitness propagation, and reliance on LLM inference for reward or instruction mutations. In WizardMath, step annotation uses synthetic labels (via ChatGPT), which may introduce subtle misalignments compared to true human preferences; in REvolve, scaling to domains requiring extensive human evaluation may become resource-intensive.

A plausible implication is that RLEIF-style frameworks could supplant black-box RLHF models in settings demanding transparent, editable reward/instruction specifications with direct human interpretability.

RLEIF intersects with and diverges from several strands of research:

RLHF: While RLHF typically employs learned reward models as latent proxies, RLEIF externalizes reward/instruction evolution, enabling human-interpretable objectives.
Curriculum learning: RLEIF’s iterative expansion and mutation of tasks/instructions echo curriculum learning, but with structured evaluation and evolutionary operators.
Neuroevolution: The use of population-based search and selection parallels neuroevolution, but at the meta-instruction or meta-reward level rather than directly at the level of policy parameters.

By bridging human tacit knowledge and explicit, codified supervision, RLEIF provides a scalable pathway for high-fidelity, robust, and interpretable alignment across a broad spectrum of RL and LLM tasks (Hazra et al., 3 Jun 2024, Luo et al., 2023).

PDF Markdown Chat (Pro)

References (2)

REvolve: Reward Evolution with Large Language Models using Human Feedback (2024)

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct (2023)

Follow Topic

Get notified by email when new papers are published related to Reinforcement Learning from Evol-Instruct Feedback (RLEIF).