Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenClaw-RL: Unified Reinforcement Learning

Updated 12 March 2026
  • OpenClaw-RL is a unified reinforcement learning framework that leverages every interaction—ranging from shell commands to GUI actions—as live training signals with both scalar rewards and corrective hints.
  • It deploys an asynchronous architecture with decoupled components for policy serving, environment responses, PRM judging, and policy training using a mixed PPO objective.
  • Empirical results show that combining binary process rewards with token-level directive supervision significantly enhances performance across multi-modal and long-horizon tasks.

OpenClaw-RL is a unified reinforcement learning (RL) framework that leverages every agent–environment interaction—ranging from personal conversations and shell commands to GUI actions and tool calls—as a live, online training signal. The core innovation is its treatment of each next-state signal (e.g., user reply or environment response) as a source of both evaluative (scalar reward) and directive (corrective, token-level hint) supervision, enabling the simultaneous training of agentic policies across diverse domains using a single infrastructure. This asynchronous, coordination-free architecture is designed for both rapid personalization and scalable multi-environment RL, with empirical results demonstrating its efficacy in both conversational and general agentic settings (Wang et al., 10 Mar 2026).

1. The Unified Agent–Environment Perspective

OpenClaw-RL formalizes every agent–environment loop as a Markov Decision Process (MDP), (S,A,T,r)(\mathcal{S}, \mathcal{A}, \mathcal{T}, r), where:

  • S\mathcal{S}: the state space, typically the full history or environment snapshot at time tt,
  • A\mathcal{A}: the action space, often sequences of tokens at=(at,1,,at,n)a_t = (a_{t,1}, \dots, a_{t,n}),
  • T(st+1st,at)\mathcal{T}(s_{t+1} \mid s_t, a_t): the (deterministic) transition generating the next-state signal st+1s_{t+1},
  • r(at,st+1)Rr(a_t, s_{t+1}) \in \mathbb{R}: the scalar reward extracted by a PRM (Preference Reward Model) judge.

The agent–environment interaction is asynchronously decoupled into four components:

  1. Policy Serving (SGLang): Samples actions under the current policy, returns log-probabilities.
  2. Environment Server (HTTP/API): Emits next-state signals from various environments.
  3. PRM Judge (SGLang/API): Computes scalar rewards and extracts directive hints.
  4. Policy Trainer (Megatron): Consumes collected data for gradient updates.

Each component runs continuously and independently, with zero coordination overhead. The architecture allows live online learning while serving requests, collecting feedback, and updating policies in parallel.

2. Extraction and Utilization of Training Signals

OpenClaw-RL distinguishes two types of supervisory signals from each next-state:

(a) Evaluative (Process) Rewards

After each action, mm independent PRM prompt calls evaluate (at,st+1)(a_t, s_{t+1}) for a scalar reward rt{+1,0,1}r_t \in \{+1, 0, -1\}, determined by majority vote:

{ri}i=1m=PRMi(at,st+1),rt=MajorityVote(r1,,rm)\{r_i\}_{i=1}^m = \mathrm{PRM}_i(a_t, s_{t+1}), \quad r_t = \mathrm{MajorityVote}(r_1, \ldots, r_m)

This reward is used as the sequence-level advantage in a PPO-style update.

(b) Directive Signals via Hindsight-Guided OPD

If the next-state includes a textual correction (e.g., “you should have checked the file first”), an OPD (On-Policy Distillation) module extracts directive hints, applying token-level, directional advantage supervision. The process consists of:

  • Hint extraction: Parallel prompt calls generate (scorei,hinti)(\mathrm{score}_i, \mathrm{hint}_i); only positive, sufficiently long hints are considered.
  • Enhanced teacher context: The original context sts_t is augmented with the selected [HINT] to form stenhs^{\mathrm{enh}}_t.
  • Directional advantage: Calculated as

At,idir=logπteacher(at,istenh)logπθ(at,ist)A^{\mathrm{dir}}_{t,i} = \log\pi_{\mathrm{teacher}}(a_{t,i} \mid s^{\mathrm{enh}}_t) - \log\pi_\theta(a_{t,i} \mid s_t)

This yields a per-token advantage integrated into policy optimization.

3. Combined Training Objective and Optimization

The combined learning signal merges sequence-level binary rewards and token-level OPD guidance via a mixed advantage:

At=wbinaryrt+wopdi=1atAt,idirA_t = w_{\mathrm{binary}} r_t + w_{\mathrm{opd}} \sum_{i=1}^{|a_t|} A^{\mathrm{dir}}_{t,i}

Defaults are wbinary=wopd=1w_{\mathrm{binary}}=w_{\mathrm{opd}}=1. The loss function is a clipped PPO surrogate with KL-regularization:

LPPO,t=min(ρtAt,clip(ρt,1ε,1+εhigh)At)\mathcal{L}_{\mathrm{PPO},t} = -\min(\rho_t A_t,\, \mathrm{clip}(\rho_t, 1-\varepsilon, 1+\varepsilon_{\mathrm{high}})A_t)

L(θ)=Et[LPPO,t]+βKLDKL[πold(st)πθ(st)]L(\theta) = \mathbb{E}_t [\mathcal{L}_{\mathrm{PPO},t}] + \beta_{\mathrm{KL}} D_{\mathrm{KL}}[\pi_{\mathrm{old}}(\cdot \mid s_t) \,||\, \pi_\theta(\cdot \mid s_t)]

With ε=0.2\varepsilon=0.2, εhigh=0.28\varepsilon_{\mathrm{high}}=0.28, and βKL=0.01\beta_{\mathrm{KL}}=0.01. This approach exploits the strengths of dense (frequent, evaluative) and sparse (rare, directive) feedback for robust optimization across action granularities.

4. Asynchronous Dataflow and Pseudocode

All OpenClaw-RL components operate asynchronously, eliminating the need for synchronous rollouts or batch coordination. The following summarizes the dataflow:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Serving/via SGLang
on_receive_request(session_id, s_t):
    a_t, logp_old  πθ.sample_with_logp(s_t)
    send_response(a_t)
    buffer.store(session_id, t, s_t, a_t, logp_old)

// NextState & PRM Judging
on_next_state(session_id, t, s_{t+1}):
    r_votes  parallel_call_m times PRM(a_t, s_{t+1})
    r_t  majority_vote(r_votes)
    hint_votes  parallel_call_m times Judge(a_t, s_{t+1})
    hints_pos  {h | (score=+1, h)  hint_votes, |h|>10}
    if hints_pos  :
        hint  argmax_length(hints_pos)
        s_enh  append_hint_to_context(s_t, hint)
        logp_teacher  πθ.force_logp(s_enh, a_t)
        A_dir  logp_teacher  logp_old // pertoken
    else:
        A_dir  null
    trainer_queue.push({s_t, a_t, logp_old, r_t, A_dir})

// Training Loop (Megatron)
repeat:
    batch  trainer_queue.pop_batch()
    for each sample in batch:
        compute A_t = r_t + sum_i A_dir_{t,i} (if A_dir present)
        compute ρ_t, clipped surrogate, KL
    L  mean PPO + β_KL·KL_penalty
    θ  θ  lr · θ L

This implementation ensures data is consumed whenever available, supporting live model serving and continual policy refinement.

5. Architectural and Implementation Considerations

  • Policy backbone: Qwen3 or compatible open-source LLM families, fine-tuned with either LoRA or full-parameter updates using Megatron-Deepspeed.
  • Serving infrastructure: SGLang provides HTTP API interfaces for both personal agents (OpenClaw) and large-scale, cloud-hosted deployments.
  • OPD hint extraction: Leverages a separate PRM prompt configured with temperature 0.6 and output cap (max 8192 tokens), filtering out short or low-quality hints.
  • Supervision: Token-level directionality is imposed directly within the main PPO objective—no second model or distillation head required.
  • Hyperparameters: Defaults include learning rates 1×1051\times10^{-5} (personal) and 1×1061\times10^{-6} (general), batch size 16 or 8–32 (per environment), PRM m=3m=3 (GUI) or m=1m=1 (others), and training updates every 16 personal-agent turns or continuously for general agents.

6. Empirical Performance and Generalization

(a) Personal Conversational Agent Adaptation

On the GSM8K dataset, OpenClaw-RL was tested with simulated "Student" and "Teacher" user types. Metrics show:

Method after 8 updates after 16 updates
Binary RL 0.25 0.23
OPD only 0.25 0.72
Combined 0.76 0.81

OPD alone produces delayed but substantial improvements due to sparse hints; binary RL yields only marginal improvement. The combined approach rapidly personalizes behavior, attaining high task scores within 36 interactions.

(b) Generalized Agentic RL Across Modalities

Large-scale evaluation was performed across terminal (SETA), GUI (OSWorld), software engineering (SWE-Bench), and tool-calling (DAPO) environments using 8B–32B parameter models. Integrated process rewards demonstrate an advantage on long-horizon tasks:

Setting Integrated Outcome Only
Tool‐call 0.30 0.17
GUI 0.33 0.31

Integrated stepwise PRM and terminal outcome rewards consistently exceed outcome-only training, particularly for complex tasks requiring sustained interaction chains.

This suggests that recovering and merging both evaluative and directive information from every next-state enables not only rapid learning from user interactions but also generalizes to diverse agent microtask environments.

7. Significance and Implications

OpenClaw-RL is distinguished by its universal, asynchronous RL design that merges scalar evaluative feedback and token-level corrections from every next-state across modalities. The methodology allows any agent—personal or general—to improve simply through ongoing usage, without requiring separate infrastructure or problem-specific engineering for each interaction type. A plausible implication is the unification of interactive RL in both personal and scalable agentic domains, leveraging routine human and tool feedback as ordinary experience signals for continual learning (Wang et al., 10 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenClaw-RL.