Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 93 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 15 tok/s
GPT-5 High 20 tok/s Pro
GPT-4o 98 tok/s
GPT OSS 120B 460 tok/s Pro
Kimi K2 217 tok/s Pro
2000 character limit reached

UI-TARS-2: Autonomous GUI Agent Model

Updated 3 September 2025
  • UI-TARS-2 is a native GUI agent that unifies perception, reasoning, action, and memory through end-to-end reinforcement learning.
  • It employs a ReAct-style paradigm with explicit intermediate thought steps, hierarchical memory, and hybrid integration of desktop, mobile, and terminal interfaces.
  • Empirical results demonstrate state-of-the-art performance across diverse benchmarks, achieving robust multi-turn interactions in complex environments.

UI-TARS-2 is a native GUI agent model designed to achieve general, autonomous interaction with graphical user interfaces through end-to-end large-scale reinforcement learning. Building on the prior UI-TARS lineage, UI-TARS-2 unifies perception, reasoning (using explicit intermediate “thought” steps), action, and memory into a single policy capable of solving complex desktop, mobile, web, and game-based tasks using screenshots and multimodal feedback alone. It introduces systematic advancements in scalable data generation, multi-turn reinforcement learning, hybrid environment integration (including file system and terminal interfaces), and efficient sandboxed rollouts, resulting in state-of-the-art performance across leading GUI, system, and game benchmarks (Wang et al., 2 Sep 2025).

1. System Architecture and Behavioral Modeling

UI-TARS-2 operates under a ReAct-style paradigm where each agent-environment transition is defined as a triplet (tᵢ, aᵢ, oᵢ):

  • tᵢ: the agent’s internal reasoning (“thought”),
  • aᵢ: the emitted atomic or composite action (e.g., click, type, SDK call),
  • oᵢ: the resulting observation (screenshot and auxiliary signals).

Trajectories τ are thus sequences τ = {(t₀, a₀, o₀), …, (t_T, a_T, o_T)}. To manage context, UI-TARS-2 employs a hierarchical memory structure:

  • Working memory (𝓦ₜ): tracks immediate, short-term interaction history for reactive control;
  • Episodic memory (𝓔ₜ): encodes condensed long-term knowledge enabling information seeking and recovery across extended tasks.

The distributed system architecture incorporates a unified sandbox platform for rollouts, supporting full cloud VM emulation (e.g., Windows, Android, Ubuntu) as well as browser-based sandboxes for game environments. The agent’s interface is strictly visual (screenshot-based) supplemented by GUI- and system-level SDKs, facilitating human-like, cross-modal interaction.

2. Training Methodology: Data Flywheel and Stabilized Reinforcement Learning

Training comprises three sequential phases:

  1. Continual Pre-Training (CT): The model is exposed to vast instructional, demonstration, and tutorial data, establishing base-level visual, textual, and interactive priors.
  2. Supervised Fine-Tuning (SFT): Human-annotated and high-quality synthetic trajectories are used to further refine the policy, now with explicit intermediate reasoning chains for each action and outcome.
  3. Reinforcement Learning (RL): Agents operate in the sandboxed environment using Proximal Policy Optimization (PPO) with bespoke enhancements (reward shaping, decoupled and length-adaptive generalized advantage estimation, value pretraining).

Central to scalability is the data flywheel: the agent continuously generates new trajectories through RL, which are validated (by a value model or functional check V(s)). Accepted trajectories are fed back to SFT, while lower-quality rollouts are looped into CT. This closed ecosystem ensures a continuously improving data distribution, addressing data scarcity and distribution shift.

Formally, the RL update employs a PPO variant following the objective: LPPO=Et[min(rtA^t,clip(rt,1ϵ,1+ϵ)A^t)]\mathcal{L}_{\mathrm{PPO}} = \mathbb{E}_t\left[ \min(r_t \hat{A}_t, \mathrm{clip}(r_t, 1-\epsilon, 1+\epsilon) \hat{A}_t ) \right] where rtr_t is the policy ratio, A^t\hat{A}_t the generalized advantage estimate, and ε the clipping threshold.

3. Hybrid Environment Integration

A distinguishing feature of UI-TARS-2 is its ability to operate in hybrid interface environments:

  • For desktop and mobile agents, the system directly integrates GUI APIs (e.g., PyAutoGUI for Windows, ADB for Android), allowing manipulation of UI elements and system resources.
  • File-level and terminal interfaces are exposed within the VM, enabling actions such as reading/writing files, invoking command-line utilities, or engaging with coding toolchains (GUI-SDK).

This hybridization significantly broadens the agent’s effective action space and makes it feasible to address system-level reasoning tasks (e.g., software engineering benchmarks, information retrieval, complex automation flows) that are unsolvable using GUI-only input.

4. Empirical Performance and Generalization

UI-TARS-2’s performance, as reported, establishes a new SOTA among open and closed agents on a broad spectrum of environments:

Benchmark UI-TARS-2 (%) UI-TARS-1.5 (%) Claude OpenAI o3 Notes
Online-Mind2Web 88.2 N/A N/A N/A Extended action space with GUI-SDK
OSWorld 47.5 42.5 <47.5 <47.5 Desktop GUI scenario
WindowsAgentArena 50.6 42.1 <50.6 <50.6 Desktop multi-task suite
AndroidWorld 73.3 N/A N/A <73.3 Mobile UI navigation
Game Suite (mean norm) 59.8 N/A <59.8 <59.8 60% of human parity, 15 games

On information-seeking and software engineering benchmarks (e.g., Terminal Bench, SWE-Bench), the agent leverages the full hybrid action space to solve tasks that blend GUI operations with system calls, demonstrating cross-task robustness.

The mean normalized score of 59.8 on a 15-game suite notably matches 60% of human-level performance, placing UI-TARS-2 at the frontier of general agent capabilities for interactive environments.

5. Training Dynamics and Stability Analysis

The UI-TARS-2 training regimen is engineered for long-horizon, multi-turn RL stability:

  • Entropy Dynamics: Entropy in the action distribution occasionally rises as the agent explores new strategy spaces, rather than uniformly decreasing, which is indicative of the constant need for exploration in visually ambiguous or combinatorially complex settings.
  • Reasoning Trajectories: As the agent becomes more competent, the mean number of tokens per “think” step drops, but spikes periodically as tasks increase in complexity.
  • Asynchronous Rollouts: High-throughput asynchronous agent-environment rollouts, supported by the unified sandbox, facilitate robust training signal and efficient scaling to large data regimes.
  • Value Model Pretraining: Bootstrapping with value model initialization is shown to accelerate policy optimization and mitigate instability in long-horizon credit assignment.

These dynamics are critical as prior system versions—especially approaches based on pure behavioral cloning—faced instability and stagnation in diverse, long-horizon scenarios.

6. Comparative Analysis with Contemporary Agents

Compared to UI-TARS-1.5, OpenAI’s computer-using agent, and Claude, UI-TARS-2 demonstrates:

  • Superior overall accuracy and robustness across desktop, mobile, and hybrid scenarios.
  • Gains attributable to both increased model parameter counts and methodological improvements (notably, the data flywheel and stabilized multi-turn RL).
  • Enhanced abilities in long-horizon planning, credit assignment, and tool-based reasoning due to the hybrid action space and multi-modal architecture.

A plausible implication is that the integration of hybrid environments and systematic RL-based curriculum advances generalization and adaptation, addressing common failure modes in previous GUI agents that were limited by GUI-only input or inflexible reasoning paths.

7. Limitations, Open Challenges, and Future Directions

Despite strong empirical advances, several open challenges remain:

  • Modest (<50–60%) absolute performance on fine-grained desktop grounding and complex motion-based actions—such as those revealed by UI-Vision (e.g., drag-and-drop)—suggests a need for even finer spatial resolution, multi-scale reasoning techniques, and more richly annotated training data (Nayak et al., 19 Mar 2025).
  • The so-called “hallucination gap” between the agent’s chain-of-thought reasoning and the actual action output has not been fully bridged; improved alignment techniques are called for.
  • Long-horizon credit assignment, curriculum construction over intermediate subgoals, retrieval-augmented context integration, and continual adaptation to changing UI distributions are cited as key research opportunities.
  • Scaling the active parameter count and integrating more advanced memory/attention mechanisms (e.g., MoE heads, explicit retrieval modules) may further close the gap to human-level mastery.

Future research is expected to refine the hybrid training paradigm by blending higher-capacity joint RL across more interfaces, scaling up model and data size, and enhancing reward modeling (e.g., with generative outcome predictors) to minimize reward hacking. Broader tool integration—including speech and haptic input—remains an open trajectory for the next generation of agents.


In summary, UI-TARS-2 consolidates a robust, multi-turn reinforcement learning paradigm with a unified, hybrid sandbox environment and systematic data-generation pipeline. These design choices result in significant empirical advances in general GUI-based agent tasks, competitive results in multimodal and open-ended environments, and set a clear path for further research on highly generalizable, robust interactive agents (Wang et al., 2 Sep 2025).