DART-GUI-7B: Decoupled VLM for GUI Automation

Updated 5 October 2025

DART-GUI-7B is an open-source vision-language GUI agent that leverages a decoupled RL training pipeline to automatically perform complex desktop and mobile operations.
The system employs asynchronous modules for environment simulation, trajectory rollout, data management, and step-wise GRPO training, enhancing computational throughput.
Empirical evaluations demonstrate a 42.13% task success rate on OSWorld, outperforming prior models by over 7 percentage points with improved resource utilization.

DART-GUI-7B is an open-source, vision-LLM (VLM)–based GUI agent trained using a decoupled, asynchronous reinforcement learning (RL) framework and adaptive data curation pipeline. It is designed to efficiently automate complex desktop and mobile tasks in diverse GUI environments while achieving state-of-the-art success rates through system-level architectural innovations. DART-GUI-7B reaches a 42.13% task success rate on OSWorld with 30 steps per episode, outperforming both its base model UI-TARS-1.5-7B (27.52%) and open-source SOTA by over 7 percentage points (Li et al., 28 Sep 2025).

1. Decoupled Reinforcement Learning Architecture

DART-GUI-7B is trained under the DART (Decoupled Agentic RL Training) framework, which splits the RL pipeline into four fully asynchronous modules:

Environment Cluster: Runs hundreds of concurrent isolated GUI desktop instances for parallelized agent-environment interaction.
Rollout Service: Dynamically allocates computation resources to generate agent trajectories (i.e., sequences of observations, thoughts, actions, and resulting states) in parallel, decoupled from training.
Data Manager: Aggregates, filters, and strategically routes completed trajectories for use in policy updates, leveraging custom rules to optimize data quality.
Trainer: Independently updates model parameters using step-wise Group Relative Policy Optimization (GRPO), asynchronously synchronized with rollout workers.

This decoupling is realized to avoid blocking on slow I/O or long inference times, achieving non-blocking communication and maximizing both compute and environment utilization.

Module	Function	Asynchronous?
Environment Cluster	Desktop instance orchestration	Yes
Rollout Service	Collects agent trajectories	Yes
Data Manager	experience pool, filtering/curation, routing	Yes
Trainer	Model parameter updates via GRPO	Yes

2. System Efficiency and Resource Utilization

DART-GUI-7B’s DART framework design results in:

1.6× rollout GPU utilization: Asynchronous inference prevents idling, with multiple workers generating trajectories even when others are blocking on updates.
1.9× training throughput: Training actions per minute (APM) nearly double, e.g., from 22.6 to 43.6 APM compared to prior batching strategies.
5.5× environment utilization: Fine-grained, rollout-wise scheduling allows immediate reassignment; no more waiting for entire batches to complete.

Efficiency gains are quantified by metrics such as perpetual GPU, environment slot occupancy, and overall training speed, as demonstrated in the paper’s benchmarks and comparison tables.

The step-wise GRPO objective used for RL is:

$J(\theta) = \mathbb{E}_{(h, s, a, R) \sim \mathcal{D}} \left[ \nabla_\theta \min \left( \frac{\pi^\text{Train}_\theta(a|h,s)}{\pi^\text{Train}_\text{old}(a|h,s)} A, \operatorname{clip} \left( \frac{\pi^\text{Train}_\theta(a|h,s)}{\pi^\text{Train}_\text{old}(a|h,s)}, 1-\epsilon_\text{low}, 1+\epsilon_\text{high} \right) A \right) - \beta D_\text{KL} \left( \pi^\text{Train}_\theta(a|h,s) \,\|\, \pi^\text{Ref}_\theta(a|h,s) \right) \right]$

where $A$ denotes the grouped step-wise advantage and the policy logits are aligned via truncated importance sampling for decoupled synchronization.

3. Adaptive Data Curation Mechanisms

DART-GUI-7B’s adaptive curation pipeline orchestrates effective learning, particularly in sparse reward, multi-step environments:

Pre-collection of successful trajectories: For tasks with low online success, “Experience Pool” maintains a buffer of successes for injection into training batches, crucial for complex/long-horizon tasks.
Dynamic rollout frequency: Tasks with high ongoing success rates are sampled less often (e.g., if over 60% success, rollout count per task is reduced), concentrating resources on lower-performing tasks.
Dynamic trajectory length: Trajectory caps are set per-task based on historical episode statistics, enabling 10-step caps for simple click tasks and up to 50-step caps for complex tasks.
High-entropy step prioritization: Only the top 80% highest token-level entropy steps in agent trajectories are selected for policy updates, focusing training on critical, high-uncertainty decisions.
Truncated importance sampling: Mitigates distributional drift between rollout policy $\pi^\text{Train}_\text{old}$ and trainer policy updates, stabilizing asynchronous and decoupled training.

Token-wise (action + thought) entropy for each step $t$ :

$H_t = \frac{1}{|r_t|+|a_t|} \sum_i H_{t,i}, \quad H_{t,i} = -\sum_{v=1}^V p_{t,i,v} \log p_{t,i,v}$

Only steps with largest $H_t$ contribute to the gradient update, avoiding overfitting to “easy” steps.

4. Policy Update Objective and RL Optimization Strategy

The policy is optimized using a step-wise GRPO variant adapted for asynchronous, group-sampled data:

Grouped advantage normalization: Batches of plausible actions are scored, normalized, and policy objectives are computed with clipping (per PPO variant) and regularized with a KL-divergence anchor to reference policies.
Reward composition: Structure and accuracy are jointly rewarded; action outputs are evaluated for format correctness and action execution accuracy (e.g., success of a GUI click or system command).
Step-wise learning: Unlike purely episode-based RL, DART-GUI-7B’s trainer updates on step-level transitions, directly aligning with the GUI agent’s compositional “thought-action” representations.

The decoupled system design ensures that gradient calculations and data sampling proceed asynchronously and near-constantly, unchecked by rollout environment latency.

5. Empirical Evaluation and Benchmark Results

On the OSWorld benchmark, DART-GUI-7B demonstrates robust improvement and generalization:

42.13% task success rate (30 step cap), compared to 27.52% for UI-TARS-1.5-7B (100 steps), with a 14.61 percentage point absolute gain and 7.34% higher than open-source SOTA.
Subdomain gains: >31 percentage point improvement in OS tasks; 21–20% absolute gains in application domains like LibreOffice Writer and Thunderbird.
Scaling efficiency: Achieves performance previously seen only in significantly larger or closed-source models with more resource-intensive training.

Performance is visualized by bar charts (e.g., Figure 1, Table 1 in the paper) comparing per-application/task success rates and resource utilization.

6. Open-Source Release and Community Reuse

The DART-GUI-7B project is fully open-sourced. Release includes:

The decoupled RL training framework (including all four asynchronous modules and curation logic).
Datasets: curated, high-value trajectories and filtered online samples supporting robust RL training.
Model checkpoints: full training progression from baseline UI-TARS-1.5-7B to final DART-GUI-7B.
Deployment infrastructure: scripts and instructions for Kubernetes- and vLLM-based cluster orchestration, enabling scalable rollout collection and efficient distributed training.

Full reproducibility and extensibility for future research in agentic RL for GUI automation is emphasized (Li et al., 28 Sep 2025).

7. Methodological Context and Implications

DART-GUI-7B’s framework, by separating environment simulation, rollout generation, data management, and trainer updates—each with their own pacing and resource pools—addresses longstanding bottlenecks in RL for GUI agents: slow environment I/O, synchronous bottlenecks, and suboptimal credit assignment. Adaptive trajectory curation further addresses sample inefficiency and sparse success, notably critical in real-world, multi-turn GUI automation.

A plausible implication is that this training paradigm is applicable beyond GUI agents, offering a blueprint for RL in domains where long-horizon, sparse-reward, and high-resource-variance tasks predominate, provided modularization and meticulous data management are feasible.

Open-source release ensures that these methodological advances are readily available for both benchmarking and extension in the agentic RL community and related domains.

PDF Markdown Chat (Pro)

References (1)

Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation (2025)

DART-GUI-7B: Decoupled VLM for GUI Automation

1. Decoupled Reinforcement Learning Architecture

2. System Efficiency and Resource Utilization

3. Adaptive Data Curation Mechanisms

4. Policy Update Objective and RL Optimization Strategy

5. Empirical Evaluation and Benchmark Results

6. Open-Source Release and Community Reuse

7. Methodological Context and Implications

Whiteboard

Follow Topic

Continue Learning

DART-GUI-7B: Decoupled VLM for GUI Automation

1. Decoupled Reinforcement Learning Architecture

2. System Efficiency and Resource Utilization

3. Adaptive Data Curation Mechanisms

4. Policy Update Objective and RL Optimization Strategy

5. Empirical Evaluation and Benchmark Results

6. Open-Source Release and Community Reuse

7. Methodological Context and Implications

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics