OpenClaw-RL: Unified Reinforcement Learning
- OpenClaw-RL is a unified reinforcement learning framework that leverages every interaction—ranging from shell commands to GUI actions—as live training signals with both scalar rewards and corrective hints.
- It deploys an asynchronous architecture with decoupled components for policy serving, environment responses, PRM judging, and policy training using a mixed PPO objective.
- Empirical results show that combining binary process rewards with token-level directive supervision significantly enhances performance across multi-modal and long-horizon tasks.
OpenClaw-RL is a unified reinforcement learning (RL) framework that leverages every agent–environment interaction—ranging from personal conversations and shell commands to GUI actions and tool calls—as a live, online training signal. The core innovation is its treatment of each next-state signal (e.g., user reply or environment response) as a source of both evaluative (scalar reward) and directive (corrective, token-level hint) supervision, enabling the simultaneous training of agentic policies across diverse domains using a single infrastructure. This asynchronous, coordination-free architecture is designed for both rapid personalization and scalable multi-environment RL, with empirical results demonstrating its efficacy in both conversational and general agentic settings (Wang et al., 10 Mar 2026).
1. The Unified Agent–Environment Perspective
OpenClaw-RL formalizes every agent–environment loop as a Markov Decision Process (MDP), , where:
- : the state space, typically the full history or environment snapshot at time ,
- : the action space, often sequences of tokens ,
- : the (deterministic) transition generating the next-state signal ,
- : the scalar reward extracted by a PRM (Preference Reward Model) judge.
The agent–environment interaction is asynchronously decoupled into four components:
- Policy Serving (SGLang): Samples actions under the current policy, returns log-probabilities.
- Environment Server (HTTP/API): Emits next-state signals from various environments.
- PRM Judge (SGLang/API): Computes scalar rewards and extracts directive hints.
- Policy Trainer (Megatron): Consumes collected data for gradient updates.
Each component runs continuously and independently, with zero coordination overhead. The architecture allows live online learning while serving requests, collecting feedback, and updating policies in parallel.
2. Extraction and Utilization of Training Signals
OpenClaw-RL distinguishes two types of supervisory signals from each next-state:
(a) Evaluative (Process) Rewards
After each action, independent PRM prompt calls evaluate for a scalar reward , determined by majority vote:
This reward is used as the sequence-level advantage in a PPO-style update.
(b) Directive Signals via Hindsight-Guided OPD
If the next-state includes a textual correction (e.g., “you should have checked the file first”), an OPD (On-Policy Distillation) module extracts directive hints, applying token-level, directional advantage supervision. The process consists of:
- Hint extraction: Parallel prompt calls generate ; only positive, sufficiently long hints are considered.
- Enhanced teacher context: The original context is augmented with the selected [HINT] to form .
- Directional advantage: Calculated as
This yields a per-token advantage integrated into policy optimization.
3. Combined Training Objective and Optimization
The combined learning signal merges sequence-level binary rewards and token-level OPD guidance via a mixed advantage:
Defaults are . The loss function is a clipped PPO surrogate with KL-regularization:
With , , and . This approach exploits the strengths of dense (frequent, evaluative) and sparse (rare, directive) feedback for robust optimization across action granularities.
4. Asynchronous Dataflow and Pseudocode
All OpenClaw-RL components operate asynchronously, eliminating the need for synchronous rollouts or batch coordination. The following summarizes the dataflow:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
// Serving/via SGLang
on_receive_request(session_id, s_t):
a_t, logp_old ← πθ.sample_with_logp(s_t)
send_response(a_t)
buffer.store(session_id, t, s_t, a_t, logp_old)
// Next‐State & PRM Judging
on_next_state(session_id, t, s_{t+1}):
r_votes ← parallel_call_m times PRM(a_t, s_{t+1})
r_t ← majority_vote(r_votes)
hint_votes ← parallel_call_m times Judge(a_t, s_{t+1})
hints_pos ← {h | (score=+1, h) ∈ hint_votes, |h|>10}
if hints_pos ≠ ∅:
hint ← argmax_length(hints_pos)
s_enh ← append_hint_to_context(s_t, hint)
logp_teacher ← πθ.force_logp(s_enh, a_t)
A_dir ← logp_teacher − logp_old // per‐token
else:
A_dir ← null
trainer_queue.push({s_t, a_t, logp_old, r_t, A_dir})
// Training Loop (Megatron)
repeat:
batch ← trainer_queue.pop_batch()
for each sample in batch:
compute A_t = r_t + sum_i A_dir_{t,i} (if A_dir present)
compute ρ_t, clipped surrogate, KL
L ← mean PPO + β_KL·KL_penalty
θ ← θ − lr · ∇θ L |
This implementation ensures data is consumed whenever available, supporting live model serving and continual policy refinement.
5. Architectural and Implementation Considerations
- Policy backbone: Qwen3 or compatible open-source LLM families, fine-tuned with either LoRA or full-parameter updates using Megatron-Deepspeed.
- Serving infrastructure: SGLang provides HTTP API interfaces for both personal agents (OpenClaw) and large-scale, cloud-hosted deployments.
- OPD hint extraction: Leverages a separate PRM prompt configured with temperature 0.6 and output cap (max 8192 tokens), filtering out short or low-quality hints.
- Supervision: Token-level directionality is imposed directly within the main PPO objective—no second model or distillation head required.
- Hyperparameters: Defaults include learning rates (personal) and (general), batch size 16 or 8–32 (per environment), PRM (GUI) or (others), and training updates every 16 personal-agent turns or continuously for general agents.
6. Empirical Performance and Generalization
(a) Personal Conversational Agent Adaptation
On the GSM8K dataset, OpenClaw-RL was tested with simulated "Student" and "Teacher" user types. Metrics show:
| Method | after 8 updates | after 16 updates |
|---|---|---|
| Binary RL | 0.25 | 0.23 |
| OPD only | 0.25 | 0.72 |
| Combined | 0.76 | 0.81 |
OPD alone produces delayed but substantial improvements due to sparse hints; binary RL yields only marginal improvement. The combined approach rapidly personalizes behavior, attaining high task scores within 36 interactions.
(b) Generalized Agentic RL Across Modalities
Large-scale evaluation was performed across terminal (SETA), GUI (OSWorld), software engineering (SWE-Bench), and tool-calling (DAPO) environments using 8B–32B parameter models. Integrated process rewards demonstrate an advantage on long-horizon tasks:
| Setting | Integrated | Outcome Only |
|---|---|---|
| Tool‐call | 0.30 | 0.17 |
| GUI | 0.33 | 0.31 |
Integrated stepwise PRM and terminal outcome rewards consistently exceed outcome-only training, particularly for complex tasks requiring sustained interaction chains.
This suggests that recovering and merging both evaluative and directive information from every next-state enables not only rapid learning from user interactions but also generalizes to diverse agent microtask environments.
7. Significance and Implications
OpenClaw-RL is distinguished by its universal, asynchronous RL design that merges scalar evaluative feedback and token-level corrections from every next-state across modalities. The methodology allows any agent—personal or general—to improve simply through ongoing usage, without requiring separate infrastructure or problem-specific engineering for each interaction type. A plausible implication is the unification of interactive RL in both personal and scalable agentic domains, leveraging routine human and tool feedback as ordinary experience signals for continual learning (Wang et al., 10 Mar 2026).