Papers
Topics
Authors
Recent
Search
2000 character limit reached

RLAnything: Dynamic RL for Co-Evolutionary Learning

Updated 3 February 2026
  • RLAnything is a reinforcement learning framework that dynamically co-evolves policy, reward model, and environment through closed feedback loops.
  • It integrates step-wise evaluations with outcome signals to automatically adjust task difficulty, boosting performance across various benchmarks.
  • The framework employs language model prompts for adaptive environment reconfiguration, establishing both theoretical criteria and empirical gains in RL.

RLAnything is a reinforcement learning (RL) framework designed for dynamically and simultaneously optimizing the three principal modules of an RL system—policy, reward model, and environment—using a closed-loop, critic-driven co-evolutionary process. Targeted at LLM and agentic scenarios, RLAnything provides fine-grained learning signals and automatic environment adaptation, thereby amplifying policy improvement and reward-model generalization. The framework integrates policy training with step-wise and outcome feedback, develops a reward model using consistency-driven RL, and adapts task difficulty through automated LLM prompts, all within mutually reinforcing optimization cycles. RLAnything demonstrates empirical gains across various benchmarks, including desktop GUI environments, text-based games, and coding tasks, and establishes theoretical criteria for adaptive environment construction and reward-model precision (Wang et al., 2 Feb 2026).

1. Modular Co-Evolution via Closed Feedback Loops

In RLAnything, the environment Q\mathcal{Q} (task set), policy πθ\pi_\theta, and reward model rϕr_\phi are interlinked in a feedback loop, each driving the development of the others:

  • Policy module (πθ\pi_\theta): Given a task qQq \in \mathcal{Q}, the policy generates a trajectory τ=(τ1,...,τT)\tau = (\tau_1, ..., \tau_T).
  • Reward model (rϕr_\phi): Provides fine-grained, step-level feedback to each trajectory by evaluating task progress at every step; step-wise signals Sτi,j{1,1}S_{\tau_i, j} \in \{-1,1\} represent assessments of the quality of each action or reasoning segment.
  • Environment module (Q\mathcal{Q}): The task set is dynamically adapted based on “critic feedback” drawn from performance indicators and error summaries; a LLM is used to propose easier or harder versions of tasks as needed.

The system utilizes three key supervisory signals:

  • Outcome signal (OτO_\tau): Binary indicator of the entire trajectory's success or failure.
  • Step-wise signals (Sτi,jS_{\tau_i, j}): Multiple independent evaluations (indexed by jj) for each step.
  • Consistency feedback: Agreement between step evaluation signals and the aggregate step-quality, quantifying reward model reliability.

Environment adaptation employs post-episode rollout accuracy measures; if a task’s success rate acc(q)acc(q) lies outside predefined thresholds [αlow,αhigh][\alpha_{low}, \alpha_{high}], targeted rewrites are solicited from a LLM, using reward-model-generated failure summaries as minimal diagnostic input.

2. Formalism of Policy, Reward, and Environment Dynamics

RLAnything employs interconnected RL objectives across its modules:

  • Integrated policy optimization:

J(πθ)=EqQ,τπθ(q)[t=1Tγt1rt],γ=1J(\pi_\theta) = \mathbb{E}_{q\sim\mathcal{Q}, \tau \sim \pi_\theta(\cdot|q)} \left[ \sum_{t=1}^T \gamma^{t-1} r_t \right], \quad \gamma=1

Here, per-step reward rtr_t is replaced with

Rτi=Oτ+λmj=1mSτi,j,λ>0R_{\tau_i} = O_\tau + \frac{\lambda}{m} \sum_{j=1}^m S_{\tau_i, j}, \quad \lambda > 0

Advantages AτiπA^\pi_{\tau_i} are computed by standardizing RτiR_{\tau_i} across batch trajectories, with policy updated using PPO or similar on-policy algorithms.

  • Reward model optimization:

LR(ϕ)=Eq,τ,Srϕ[RτiSτi,j]+λconsC(ϕ)L_R(\phi) = -\mathbb{E}_{q,\tau,S\sim r_\phi}\left[R_{\tau_i}\cdot S_{\tau_i,j}\right] + \lambda_{cons} \cdot C(\phi)

Here, C(ϕ)C(\phi) enforces self-consistency in step-level labels, and PPO is applied analogously to maximize agreement between Sτi,jS_{\tau_i,j} and RτiR_{\tau_i}.

  • Environment adaptation:

Q\mathcal{Q} is updated by minimizing

q=argminqcand(q)Lenv(q;πθ,rϕ)q^* = \arg\min_{q'\in \mathrm{cand}(q)} L_{env}(q'; \pi_\theta, r_\phi)

subject to αlow<acc(q)<αhigh\alpha_{low} < acc(q') < \alpha_{high}. Candidate tasks are generated using LLMs with prompts informed by failure mode summaries; acceptance is conditional on efficacy and appropriate difficulty calibration.

3. Systemic Algorithmic Structure

A single optimization run comprises interleaved updates:

  1. Sampling:
    • Sample tasks from Q\mathcal{Q}
    • Generate NN rollouts per task with policy πθ\pi_\theta
    • Record outcomes and step-wise reward model signals
    • Compute integrated rewards for each trajectory step
  2. Policy update:
    • Standardize integrated rewards to compute advantages
    • Update policy via PPO or compatible on-policy methods
  3. Reward model update:
    • Compute advantages for reward model signals
    • Update rϕr_\phi using PPO, maximizing expected signal agreement
  4. Environment adaptation:
    • Compute per-task rollout accuracy
    • For acc(q)>αhighacc(q)>\alpha_{high}, propose harder variants; for acc(q)<αlowacc(q)<\alpha_{low}, propose simpler ones
    • Accept new tasks only if they move difficulty within the target accuracy band, preserving proximity to the learner’s current capabilities

Critic feedback for environment adaptation is constructed from consolidated explanations accompanying negative reward model evaluations, guiding minimal, context-targeted environment adjustments.

4. Theoretical Criteria for Reward Precision and Balance

Two foundational theorems underpin RLAnything’s adaptive mechanisms:

  • Theorem 1 (Reward Precision):

Given p+=Prϕ[Sτi+,j=+1]p_+ = P_{r_\phi}[S_{\tau_i^+, j}=+1], p=Prϕ[Sτi,j=1]p_- = P_{r_\phi}[S_{\tau_i^-, j}=-1], and μ=p++p\mu = p_+ + p_-, the probability that mean step score for successful transitions (τi+\tau_i^+) exceeds failures (τi\tau_i^-) converges to 1 iff μ>1\mu > 1 as mm\to\infty:

P(Sˉτi+>Sˉτi)1em(μ1)2/4\mathbb{P}(\bar{S}_{\tau_i^+} > \bar{S}_{\tau_i^-}) \geq 1-e^{-m(\mu-1)^2/4}

This sets a sufficient precision criterion for reward model quality with respect to batch balancing.

  • Theorem 2 (Importance-Weight Imbalance):

With λ=1\lambda=1, the expected reward for the reward model can be decomposed as

E[RSτi,j]=4Eq[p+,f++p,f]+const\mathbb{E}[R_{S_{\tau_i, j}}] = 4\, \mathbb{E}_q [\langle p_+, f_+ \rangle + \langle p_-, f_- \rangle] + \mathrm{const}

where f+f_+ and ff_- are importance weighting functions over policy-generated samples. If a task becomes too easy or too difficult (P(Oτ=±1q)1P(O_\tau = \pm 1|q) \to 1), the balance of sample frequencies is lost and the precision μ\mu degrades. A plausible implication is that balanced task distributions across outcome classes are necessary for maximum reward model informativeness.

5. Empirical Evidence and Benchmarking

Empirical evaluations span three application categories, each using foundation models as policy/reward modules, and measuring effects of incremental RLAnything components:

Setting Before RL +Policy RL +Reward Model +Env Adapt (Full RLAnything)
OSWorld In 40.4% 48.3% 49.6% 52.1% (+11.7)
OSWorld OOD 16.1% 19.8% 20.0% 21.3% (+5.2)
AlfWorld In 39.0% 51.1% 55.6% 60.2% (+21.2)
AlfWorld OOD 44.9% 59.3% 61.8% 63.6% (+18.7)
LiveBench Code 31.3% 38.8% 40.0% 43.2% (+11.9)
LiveBench UT 27.8% 27.8% 73.3% 78.9% (+51.1)
LiveBench Detect 19.6% 19.6% 37.9% 48.5% (+28.9)

The RLAnything system consistently outperforms baseline and isolated RL variants, with environment adaptation producing marked improvements in both in-domain and out-of-domain settings. Ablation studies show integrated step-wise and outcome rewards significantly surpass approaches based only on sparse trajectory endpoints. The choice of integration hyperparameter λ\lambda is critical, with λ=1\lambda=1 yielding peak performance. Environment adaptation is validated by linear growth in accepted tasks and high pass rates under independent verifiers (94%\geq 94\% with 32B model checks).

6. Research Implications and Limitations

The RLAnything framework formalizes the dynamic co-evolution of policy, reward model, and environment within RL, emphasizing the necessity of balanced, adaptive data for effective reward learning and policy optimization. Its use of LLM-based environment reconfiguration distinguishes it from prior static or semi-automated curriculum approaches. The theoretical results supply new guarantees for reward model reliability that are conditional on outcome-class balance, while empirical data establish practical efficacy across diverse, high-complexity domains.

A plausible implication is that similar closed-loop adaptation could be ported to other high-capacity agentic learning systems, provided fine-grained evaluation and task reconfiguration mechanisms are tractable.

Potential limitations pertain to dependence on high-quality reward model calibration, the requirement for scalable LLM intervention for environment adaptation, and possible overfitting to synthetic task distributions due to repeated adaptation cycles.


Reference: "RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System" (Wang et al., 2 Feb 2026)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RLAnything Framework.