RLAnything: Dynamic RL for Co-Evolutionary Learning

Updated 3 February 2026

RLAnything is a reinforcement learning framework that dynamically co-evolves policy, reward model, and environment through closed feedback loops.
It integrates step-wise evaluations with outcome signals to automatically adjust task difficulty, boosting performance across various benchmarks.
The framework employs language model prompts for adaptive environment reconfiguration, establishing both theoretical criteria and empirical gains in RL.

RLAnything is a reinforcement learning (RL) framework designed for dynamically and simultaneously optimizing the three principal modules of an RL system—policy, reward model, and environment—using a closed-loop, critic-driven co-evolutionary process. Targeted at LLM and agentic scenarios, RLAnything provides fine-grained learning signals and automatic environment adaptation, thereby amplifying policy improvement and reward-model generalization. The framework integrates policy training with step-wise and outcome feedback, develops a reward model using consistency-driven RL, and adapts task difficulty through automated LLM prompts, all within mutually reinforcing optimization cycles. RLAnything demonstrates empirical gains across various benchmarks, including desktop GUI environments, text-based games, and coding tasks, and establishes theoretical criteria for adaptive environment construction and reward-model precision (Wang et al., 2 Feb 2026).

1. Modular Co-Evolution via Closed Feedback Loops

In RLAnything, the environment $\mathcal{Q}$ (task set), policy $\pi_\theta$ , and reward model $r_\phi$ are interlinked in a feedback loop, each driving the development of the others:

Policy module ( $\pi_\theta$ ): Given a task $q \in \mathcal{Q}$ , the policy generates a trajectory $\tau = (\tau_1, ..., \tau_T)$ .
Reward model ( $r_\phi$ ): Provides fine-grained, step-level feedback to each trajectory by evaluating task progress at every step; step-wise signals $S_{\tau_i, j} \in \{-1,1\}$ represent assessments of the quality of each action or reasoning segment.
Environment module ( $\mathcal{Q}$ ): The task set is dynamically adapted based on “critic feedback” drawn from performance indicators and error summaries; a LLM is used to propose easier or harder versions of tasks as needed.

The system utilizes three key supervisory signals:

Outcome signal ( $O_\tau$ ): Binary indicator of the entire trajectory's success or failure.
Step-wise signals ( $S_{\tau_i, j}$ ): Multiple independent evaluations (indexed by $j$ ) for each step.
Consistency feedback: Agreement between step evaluation signals and the aggregate step-quality, quantifying reward model reliability.

Environment adaptation employs post-episode rollout accuracy measures; if a task’s success rate $acc(q)$ lies outside predefined thresholds $[\alpha_{low}, \alpha_{high}]$ , targeted rewrites are solicited from a LLM, using reward-model-generated failure summaries as minimal diagnostic input.

2. Formalism of Policy, Reward, and Environment Dynamics

RLAnything employs interconnected RL objectives across its modules:

Integrated policy optimization:

$J(\pi_\theta) = \mathbb{E}_{q\sim\mathcal{Q}, \tau \sim \pi_\theta(\cdot|q)} \left[ \sum_{t=1}^T \gamma^{t-1} r_t \right], \quad \gamma=1$

Here, per-step reward $r_t$ is replaced with

$R_{\tau_i} = O_\tau + \frac{\lambda}{m} \sum_{j=1}^m S_{\tau_i, j}, \quad \lambda > 0$

Advantages $A^\pi_{\tau_i}$ are computed by standardizing $R_{\tau_i}$ across batch trajectories, with policy updated using PPO or similar on-policy algorithms.

Reward model optimization:

$L_R(\phi) = -\mathbb{E}_{q,\tau,S\sim r_\phi}\left[R_{\tau_i}\cdot S_{\tau_i,j}\right] + \lambda_{cons} \cdot C(\phi)$

Here, $C(\phi)$ enforces self-consistency in step-level labels, and PPO is applied analogously to maximize agreement between $S_{\tau_i,j}$ and $R_{\tau_i}$ .

Environment adaptation:

$\mathcal{Q}$ is updated by minimizing

$q^* = \arg\min_{q'\in \mathrm{cand}(q)} L_{env}(q'; \pi_\theta, r_\phi)$

subject to $\alpha_{low} < acc(q') < \alpha_{high}$ . Candidate tasks are generated using LLMs with prompts informed by failure mode summaries; acceptance is conditional on efficacy and appropriate difficulty calibration.

3. Systemic Algorithmic Structure

A single optimization run comprises interleaved updates:

Sampling:
- Sample tasks from $\mathcal{Q}$
- Generate $N$ rollouts per task with policy $\pi_\theta$
- Record outcomes and step-wise reward model signals
- Compute integrated rewards for each trajectory step
Policy update:
- Standardize integrated rewards to compute advantages
- Update policy via PPO or compatible on-policy methods
Reward model update:
- Compute advantages for reward model signals
- Update $r_\phi$ using PPO, maximizing expected signal agreement
Environment adaptation:
- Compute per-task rollout accuracy
- For $acc(q)>\alpha_{high}$ , propose harder variants; for $acc(q)<\alpha_{low}$ , propose simpler ones
- Accept new tasks only if they move difficulty within the target accuracy band, preserving proximity to the learner’s current capabilities

Critic feedback for environment adaptation is constructed from consolidated explanations accompanying negative reward model evaluations, guiding minimal, context-targeted environment adjustments.

4. Theoretical Criteria for Reward Precision and Balance

Two foundational theorems underpin RLAnything’s adaptive mechanisms:

Theorem 1 (Reward Precision):

Given $p_+ = P_{r_\phi}[S_{\tau_i^+, j}=+1]$ , $p_- = P_{r_\phi}[S_{\tau_i^-, j}=-1]$ , and $\mu = p_+ + p_-$ , the probability that mean step score for successful transitions ( $\tau_i^+$ ) exceeds failures ( $\tau_i^-$ ) converges to 1 iff $\mu > 1$ as $m\to\infty$ :

$\mathbb{P}(\bar{S}_{\tau_i^+} > \bar{S}_{\tau_i^-}) \geq 1-e^{-m(\mu-1)^2/4}$

This sets a sufficient precision criterion for reward model quality with respect to batch balancing.

Theorem 2 (Importance-Weight Imbalance):

With $\lambda=1$ , the expected reward for the reward model can be decomposed as

$\mathbb{E}[R_{S_{\tau_i, j}}] = 4\, \mathbb{E}_q [\langle p_+, f_+ \rangle + \langle p_-, f_- \rangle] + \mathrm{const}$

where $f_+$ and $f_-$ are importance weighting functions over policy-generated samples. If a task becomes too easy or too difficult ( $P(O_\tau = \pm 1|q) \to 1$ ), the balance of sample frequencies is lost and the precision $\mu$ degrades. A plausible implication is that balanced task distributions across outcome classes are necessary for maximum reward model informativeness.

5. Empirical Evidence and Benchmarking

Empirical evaluations span three application categories, each using foundation models as policy/reward modules, and measuring effects of incremental RLAnything components:

Setting	Before RL	+Policy RL	+Reward Model	+Env Adapt (Full RLAnything)
OSWorld In	40.4%	48.3%	49.6%	52.1% (+11.7)
OSWorld OOD	16.1%	19.8%	20.0%	21.3% (+5.2)
AlfWorld In	39.0%	51.1%	55.6%	60.2% (+21.2)
AlfWorld OOD	44.9%	59.3%	61.8%	63.6% (+18.7)
LiveBench Code	31.3%	38.8%	40.0%	43.2% (+11.9)
LiveBench UT	27.8%	27.8%	73.3%	78.9% (+51.1)
LiveBench Detect	19.6%	19.6%	37.9%	48.5% (+28.9)

The RLAnything system consistently outperforms baseline and isolated RL variants, with environment adaptation producing marked improvements in both in-domain and out-of-domain settings. Ablation studies show integrated step-wise and outcome rewards significantly surpass approaches based only on sparse trajectory endpoints. The choice of integration hyperparameter $\lambda$ is critical, with $\lambda=1$ yielding peak performance. Environment adaptation is validated by linear growth in accepted tasks and high pass rates under independent verifiers ( $\geq 94\%$ with 32B model checks).

6. Research Implications and Limitations

The RLAnything framework formalizes the dynamic co-evolution of policy, reward model, and environment within RL, emphasizing the necessity of balanced, adaptive data for effective reward learning and policy optimization. Its use of LLM-based environment reconfiguration distinguishes it from prior static or semi-automated curriculum approaches. The theoretical results supply new guarantees for reward model reliability that are conditional on outcome-class balance, while empirical data establish practical efficacy across diverse, high-complexity domains.

A plausible implication is that similar closed-loop adaptation could be ported to other high-capacity agentic learning systems, provided fine-grained evaluation and task reconfiguration mechanisms are tractable.

Potential limitations pertain to dependence on high-quality reward model calibration, the requirement for scalable LLM intervention for environment adaptation, and possible overfitting to synthetic task distributions due to repeated adaptation cycles.

Reference: "RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System" (Wang et al., 2 Feb 2026)

Markdown Report Issue Upgrade to Chat

References (1)

RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RLAnything Framework.