RLAnything: Dynamic RL for Co-Evolutionary Learning
- RLAnything is a reinforcement learning framework that dynamically co-evolves policy, reward model, and environment through closed feedback loops.
- It integrates step-wise evaluations with outcome signals to automatically adjust task difficulty, boosting performance across various benchmarks.
- The framework employs language model prompts for adaptive environment reconfiguration, establishing both theoretical criteria and empirical gains in RL.
RLAnything is a reinforcement learning (RL) framework designed for dynamically and simultaneously optimizing the three principal modules of an RL system—policy, reward model, and environment—using a closed-loop, critic-driven co-evolutionary process. Targeted at LLM and agentic scenarios, RLAnything provides fine-grained learning signals and automatic environment adaptation, thereby amplifying policy improvement and reward-model generalization. The framework integrates policy training with step-wise and outcome feedback, develops a reward model using consistency-driven RL, and adapts task difficulty through automated LLM prompts, all within mutually reinforcing optimization cycles. RLAnything demonstrates empirical gains across various benchmarks, including desktop GUI environments, text-based games, and coding tasks, and establishes theoretical criteria for adaptive environment construction and reward-model precision (Wang et al., 2 Feb 2026).
1. Modular Co-Evolution via Closed Feedback Loops
In RLAnything, the environment (task set), policy , and reward model are interlinked in a feedback loop, each driving the development of the others:
- Policy module (): Given a task , the policy generates a trajectory .
- Reward model (): Provides fine-grained, step-level feedback to each trajectory by evaluating task progress at every step; step-wise signals represent assessments of the quality of each action or reasoning segment.
- Environment module (): The task set is dynamically adapted based on “critic feedback” drawn from performance indicators and error summaries; a LLM is used to propose easier or harder versions of tasks as needed.
The system utilizes three key supervisory signals:
- Outcome signal (): Binary indicator of the entire trajectory's success or failure.
- Step-wise signals (): Multiple independent evaluations (indexed by ) for each step.
- Consistency feedback: Agreement between step evaluation signals and the aggregate step-quality, quantifying reward model reliability.
Environment adaptation employs post-episode rollout accuracy measures; if a task’s success rate lies outside predefined thresholds , targeted rewrites are solicited from a LLM, using reward-model-generated failure summaries as minimal diagnostic input.
2. Formalism of Policy, Reward, and Environment Dynamics
RLAnything employs interconnected RL objectives across its modules:
- Integrated policy optimization:
Here, per-step reward is replaced with
Advantages are computed by standardizing across batch trajectories, with policy updated using PPO or similar on-policy algorithms.
- Reward model optimization:
Here, enforces self-consistency in step-level labels, and PPO is applied analogously to maximize agreement between and .
- Environment adaptation:
is updated by minimizing
subject to . Candidate tasks are generated using LLMs with prompts informed by failure mode summaries; acceptance is conditional on efficacy and appropriate difficulty calibration.
3. Systemic Algorithmic Structure
A single optimization run comprises interleaved updates:
- Sampling:
- Sample tasks from
- Generate rollouts per task with policy
- Record outcomes and step-wise reward model signals
- Compute integrated rewards for each trajectory step
- Policy update:
- Standardize integrated rewards to compute advantages
- Update policy via PPO or compatible on-policy methods
- Reward model update:
- Compute advantages for reward model signals
- Update using PPO, maximizing expected signal agreement
- Environment adaptation:
- Compute per-task rollout accuracy
- For , propose harder variants; for , propose simpler ones
- Accept new tasks only if they move difficulty within the target accuracy band, preserving proximity to the learner’s current capabilities
Critic feedback for environment adaptation is constructed from consolidated explanations accompanying negative reward model evaluations, guiding minimal, context-targeted environment adjustments.
4. Theoretical Criteria for Reward Precision and Balance
Two foundational theorems underpin RLAnything’s adaptive mechanisms:
- Theorem 1 (Reward Precision):
Given , , and , the probability that mean step score for successful transitions () exceeds failures () converges to 1 iff as :
This sets a sufficient precision criterion for reward model quality with respect to batch balancing.
- Theorem 2 (Importance-Weight Imbalance):
With , the expected reward for the reward model can be decomposed as
where and are importance weighting functions over policy-generated samples. If a task becomes too easy or too difficult (), the balance of sample frequencies is lost and the precision degrades. A plausible implication is that balanced task distributions across outcome classes are necessary for maximum reward model informativeness.
5. Empirical Evidence and Benchmarking
Empirical evaluations span three application categories, each using foundation models as policy/reward modules, and measuring effects of incremental RLAnything components:
| Setting | Before RL | +Policy RL | +Reward Model | +Env Adapt (Full RLAnything) |
|---|---|---|---|---|
| OSWorld In | 40.4% | 48.3% | 49.6% | 52.1% (+11.7) |
| OSWorld OOD | 16.1% | 19.8% | 20.0% | 21.3% (+5.2) |
| AlfWorld In | 39.0% | 51.1% | 55.6% | 60.2% (+21.2) |
| AlfWorld OOD | 44.9% | 59.3% | 61.8% | 63.6% (+18.7) |
| LiveBench Code | 31.3% | 38.8% | 40.0% | 43.2% (+11.9) |
| LiveBench UT | 27.8% | 27.8% | 73.3% | 78.9% (+51.1) |
| LiveBench Detect | 19.6% | 19.6% | 37.9% | 48.5% (+28.9) |
The RLAnything system consistently outperforms baseline and isolated RL variants, with environment adaptation producing marked improvements in both in-domain and out-of-domain settings. Ablation studies show integrated step-wise and outcome rewards significantly surpass approaches based only on sparse trajectory endpoints. The choice of integration hyperparameter is critical, with yielding peak performance. Environment adaptation is validated by linear growth in accepted tasks and high pass rates under independent verifiers ( with 32B model checks).
6. Research Implications and Limitations
The RLAnything framework formalizes the dynamic co-evolution of policy, reward model, and environment within RL, emphasizing the necessity of balanced, adaptive data for effective reward learning and policy optimization. Its use of LLM-based environment reconfiguration distinguishes it from prior static or semi-automated curriculum approaches. The theoretical results supply new guarantees for reward model reliability that are conditional on outcome-class balance, while empirical data establish practical efficacy across diverse, high-complexity domains.
A plausible implication is that similar closed-loop adaptation could be ported to other high-capacity agentic learning systems, provided fine-grained evaluation and task reconfiguration mechanisms are tractable.
Potential limitations pertain to dependence on high-quality reward model calibration, the requirement for scalable LLM intervention for environment adaptation, and possible overfitting to synthetic task distributions due to repeated adaptation cycles.
Reference: "RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System" (Wang et al., 2 Feb 2026)