UI-TARS: Autonomous Multi-Modal GUI Automation

Updated 7 May 2026

UI-TARS system is a vision-centric autonomous GUI agent framework that integrates perception, reasoning, and native action execution using screenshots and high-level instructions.
It employs advanced vision-language transformers, multi-turn reinforcement learning, and iterative training to achieve scalable and robust automation across desktop, web, and mobile environments.
The system features a unified action vocabulary and closed-loop feedback, attaining state-of-the-art performance on diverse GUI task benchmarks.

The UI-TARS system is a family of end-to-end, vision-centric agents and APIs for automating and interacting with graphical user interfaces (GUIs) across desktop, web, mobile, and hybrid environments. Grounded in advances in multimodal foundation models, reinforcement learning (RL), and large-scale iterative training, UI-TARS is designed to unify perception, reasoning, and native action execution in a closed loop requiring only screenshots and high-level user instruction as input. The system has undergone several generations, from the original UI-TARS with explicit “System-2” explicit reasoning to the RUIG-powered UI-TARS Task Automation API and the ascendant UI-TARS-2, which integrates scalable trajectory pipelines, multi-turn RL, and hybrid environment orchestration for state-of-the-art GUI agent performance (Zhang et al., 2023, Qin et al., 21 Jan 2025, Wang et al., 2 Sep 2025).

1. Overall System Architecture

UI-TARS follows a “native agent" paradigm whereby perception, reasoning, action, and memory modules are unified in a closed feedback loop. For a single interaction cycle:

Input: Free-form natural-language instruction and a raw GUI screenshot.
Perception: A vision-language transformer (e.g., Swin Transformer or Qwen-2-VL) processes the screenshot, extracting dense feature representations and element-level grounding.
Reasoning: The agent generates explicit “thought” tokens, modeling intermediate deliberation (chain-of-thought) following System-2 reasoning principles.
Action: The agent outputs a GUI action (e.g., click, drag, type, tool API call), serialized in a unified action vocabulary, with arguments normalized by screen dimensions.
Environment: The action is executed; the next screenshot is captured, forming the next perception loop.

The architecture modularity extends to large-scale sandboxes, with thousands of X86/ARM VMs (Windows/Ubuntu, Android emulators), browser-based game harnesses, shared filesystems, and terminal integrations supporting complex workflows spanning GUIs, command lines, and beyond (Wang et al., 2 Sep 2025, Qin et al., 21 Jan 2025).

2. Perception and Pixel-to-Sequence Grounding

The foundational ability of UI-TARS systems to map language instructions to concrete GUI actions is achieved through advanced, large-scale perception pretraining:

Pretraining Corpus: ≈3 million screenshots are paired with element descriptions (type, position, function), dense and state-transition captions, QA pairs, and set-of-marks (SoM).
Vision-Language Backbone: Models such as Qwen-2-VL (7B/72B) are adapted to learn holistic screen understanding and fine-grained element recognition.
Pixel-to-Sequence Paradigm: In RUIG-style grounding (UI-TARS API), bounding box localization is formulated as an autoregressive token sequence:

$y_{seq} = [<predict\_bbox>, <x_{min}>, t_{x_{min}}, </x_{min}>, ... <y_{max}>, t_{y_{max}}, </y_{max}>, </predict\_bbox>, <eos>]$

Each coordinate is tokenized and decoded as language, enabling alignment of visual and linguistic modality through sequence modeling (Zhang et al., 2023).

3. Reasoning and Unified Action Modeling

UI-TARS integrates reasoning directly into policy generation:

Explicit "Thoughts": Before each action, the agent produces an intermediate reasoning step, supporting decomposition, milestone recognition, trial-and-error, and reflection. Annotation pipelines such as ActRe and causal bootstrapped thoughts ensure coverage of diverse reasoning patterns.
Unified Action Space: Actions across desktop, mobile, and web (click, drag, scroll, type, wait, API/tool calls) are serialized as token sequences,

$\langle \text{ActionType}, \text{Arg}_1, ..., \text{Arg}_k \rangle$

with coordinates normalized for layout invariance.

Chain-of-Thought and ReAct: In UI-TARS-2, a Mixture-of-Experts (MoE) model with 23B active parameters fuses chain-of-thought reasoning with ReAct-style planning, conditioning on working and episodic memory (Wang et al., 2 Sep 2025, Qin et al., 21 Jan 2025).

4. Training Methodology: Iterative, Multi-Turn RL and Reflection

UI-TARS relies on a multi-stage, self-improving training regime:

Data Flywheel: Iterative cycles of data collection generate multi-step, multi-modal traces through self-play, human-in-the-loop annotation, and rejection sampling. Datasets are partitioned into continual pre-training, supervised fine-tuning, and RL pools. High-quality traces (V(s) = 1) seed SFT, while lower-quality samples are recycled into pre-training (Wang et al., 2 Sep 2025).
Reinforcement Learning: Multi-turn RL is central, casting each episode as $\tau = \{ (t_0, a_0, o_0), ..., (t_T, a_T, o_T) \}$ and optimizing

$J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^T r_t\right]$

with reward signals reflecting correctness, trajectory efficiency, and format.

Algorithmic Enhancements: Techniques include PPO with entropy regularization, decoupled and length-adaptive GAE, value pretraining, "clip-higher" ratio for exploration, hybrid RL/domain interpolation, and Direct Preference Optimization (DPO) for error correction (Zhang et al., 2023, Wang et al., 2 Sep 2025).
Reflection Tuning: Error/correction pairs allow DPO-based tuning, systematically correcting action preferences and improving recovery from mistakes (Qin et al., 21 Jan 2025).

5. API Contract and Integration Pipeline

UI-TARS exposes a generic UI grounding endpoint—primarily as a RESTful JSON API:

Endpoint: POST /v1/uitars/ground
Request: Includes base64/URL screenshot, free-form instruction, optional timeout.
Response: Returns predicted bounding box, coordinate token sequence, estimated IoU, and model version.
Integration: The API enables orchestrators or LLM planners to delegate grounding for each UI step, replaying actions via backend automation (e.g., PyAutoGUI).
Inference Steps:

Preprocessing: Image normalization and tokenization.
Extraction: Visual encoder and language decoder produce tokens autoregressively.
Decoding: Greedy or beam search reconstructs bounding box tokens.
Postprocessing: Conversion of tokens to $(x_\text{min}, y_\text{min}, x_\text{max}, y_\text{max})$ for action replay (Zhang et al., 2023).

6. Empirical Performance and Generalization

UI-TARS benchmarks demonstrate state-of-the-art results:

GUI Task Benchmarks: UI-TARS-2 achieves 88.2% on Online-Mind2Web, 47.5% on OSWorld, 73.3% on AndroidWorld, surpassing Claude-4, GPT-4o, and other commercial frameworks under screenshot-only conditions (Wang et al., 2 Sep 2025, Qin et al., 21 Jan 2025).
Game Environments: Reaches a mean normalized score of 59.8% of human-level performance across 15 games; on LMGame-Bench, achieves strong parity with OpenAI o3 on titles such as 2048 and Super Mario Bros.
Long-Horizon and Code Tasks: 29.6% on BrowseComp-en with SDK augmentation (vs 7.0% GUI-only), 68.7% on SWE-Bench.
Ablation Insights: PPO outperforms GRPO in low-variance reward settings; value pretraining and decoupled GAE stabilize training on long sequences; quantization (W4A8) offers a moderate tradeoff between speed and accuracy (Wang et al., 2 Sep 2025).

7. Key Contributions and Evolution Path

UI-TARS advances the field of automated GUI agents by providing:

A pure-vision, end-to-end agent that unifies perception, reasoning, action, and memory, requiring minimal human engineering.
A self-improving data pipeline, combining RL, SFT, continual pretraining, and automated reflection tuning with DPO.
Generalization to interactive, multimodal, and long-horizon agent tasks without reliance on domain heuristics or prompt chaining.
Empirical evidence of superior generalization and robustness compared to modular orchestrated frameworks (Claude, GPT-4o).
Readiness for integration as the executor core in LLM-driven automation pipelines, enabling high-precision, low-latency interaction with a diverse range of real-world applications and environments (Zhang et al., 2023, Wang et al., 2 Sep 2025, Qin et al., 21 Jan 2025).

Selected Performance Metrics for Major Benchmarks

Model	OSWorld (15 steps)	AndroidWorld	Online-Mind2Web
GPT-4o+Aria-UI	15.2	44.8	—
UI-TARS-72B-DPO	22.7 / 24.6	46.6	—
UI-TARS-2	47.5	73.3	88.2
Claude-4	14.9	—	—

A plausible implication is that UI-TARS establishes a new paradigm for autonomous GUI agents, paving the way for further research into unified multimodal action models, scalable RL on virtualized environments, and agentic reasoning grounded directly in pixel-level GUI inputs (Qin et al., 21 Jan 2025, Wang et al., 2 Sep 2025, Zhang et al., 2023).