UI-TARS: Pioneering Automated GUI Interaction with Native Agents (2501.12326v1)

Published 21 Jan 2025 in cs.AI, cs.CL, cs.CV, and cs.HC

Abstract: This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

Summary

The paper introduces UI-TARS, a native GUI agent model that uses screenshots as input and achieves state-of-the-art performance on benchmarks like OSWorld and AndroidWorld.
UI-TARS integrates innovations such as enhanced GUI perception, unified action modeling across platforms, and System-2 reasoning for robust task execution.
The work proposes iterative training with reflective online traces for continuous learning and provides an open-source model, advancing the field of autonomous GUI agents.

The paper "UI-TARS: Pioneering Automated GUI Interaction with Native Agents" introduces UI-TARS, a native GUI agent model that uses screenshots as input and performs human-like interactions via keyboard and mouse operations.

The model achieves state-of-the-art (SOTA) performance across GUI agent benchmarks, evaluating perception, grounding, and GUI task execution. Specifically, on the OSWorld benchmark, UI-TARS scores $24.6$ with $50$ steps and %%%%2%%%% with $15$ steps, outperforming Claude's $22.0$ and $14.9$, respectively. On AndroidWorld, UI-TARS achieves $46.6$, surpassing GPT-4o's $34.5$.

Key innovations in UI-TARS include:

Enhanced Perception: A large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning.
Unified Action Modeling: Actions are standardized into a unified space across platforms, achieving precise grounding and interaction through action traces.
System-2 Reasoning: Deliberate reasoning is incorporated into multi-step decision-making, using patterns like task decomposition, reflection, and milestone recognition.
Iterative Training with Reflective Online Traces: Addresses data bottlenecks by automatically collecting, filtering, and refining new interaction traces on virtual machines, enabling continuous learning with minimal human intervention.

The paper analyzes the evolution of GUI agents and guides the development in this domain. UI-TARS is available as open source.

The paper examines the progression of GUI agents from rule-based systems to autonomous systems. The development is divided into stages based on the degree of human intervention, autonomy, flexibility, and generalization ability.

The development stages are:

Rule-based Agents: Early agents, such as Robotic Process Automation (RPA) systems, replicate human actions in structured environments using predefined rules and APIs. These systems require explicit instructions and lack the ability to learn or adapt, limiting their scalability. Examples include DART, WoB, Roscript, and FLIN.
Agent Framework: These systems leverage the reasoning capabilities of foundation models (e.g., GPT-4 and GPT-4o) to enhance task execution flexibility. Frameworks like AutoGPT and LangChain integrate external tools, APIs, and services, enabling dynamic workflows. Performance is enhanced through task-specific workflows, optimized prompts, and specialized modules for memory. The ReAct framework integrates reasoning with action outcomes to improve planning.
Native Agent Model: Workflow knowledge is embedded directly within the agent's model through end-to-end learning. Tasks are learned and executed in a unified manner, integrating perception, reasoning, memory, and action. Native agents adapt to new tasks without manual prompts or predefined rules, offering holistic learning, reduced human engineering, and continuous self-improvement. Examples include Claude Computer-Use, Aguvis, ShowUI, and OS-Atlas.
Active and Lifelong Agent: Represents a crucial next step in the evolution of GUI agents. In this paradigm, agents actively engage with their environment to propose tasks, execute them, and evaluate the outcomes. These agents can autonomously assign self-rewards based on the success of their actions, reinforcing positive behaviors and progressively refining their capabilities through continuous feedback loops.

The paper analyzes the core capabilities required for native agent models, focusing on perception, action, reasoning, and memory.

Perception: Involves the real-time interpretation of GUIs, adapting to interface changes, and understanding static screenshots. Methods include structured text-based approaches (using HTML structures), visual screenshot-based approaches (relying on screenshots), and comprehensive interface modeling (integrating text, visuals, and semantic outlines).
Action: Requires versatile, precise mechanisms adaptable to various GUI contexts. Key aspects include defining a unified action space that abstracts platform-specific actions into a common set of operations (e.g., click, type, scroll, drag). Challenges involve accurately determining coordinates for actions due to variability in GUI layouts and device aspect ratios.
Reasoning: Integrates cognitive functions, emulating both System 1 (fast, intuitive thinking) and System 2 (slow, deliberate, analytical thinking). System 1 reasoning involves quick responses based on pre-learned knowledge, while System 2 reasoning handles complex tasks using techniques like Chain-of-Thought (CoT) and ReAct, incorporating task decomposition, long-term consistency, milestone recognition, trial and error, and reflection.
Memory: Used to store explicit knowledge and historical experiences, including short-term memory (task-specific information and immediate context) and long-term memory (records of previous interactions, tasks, and background knowledge). Native agent models encode operational experience within their internal parameters, using techniques like In-Context Learning (ICL) or CoT reasoning to activate this internal memory.

The paper also summarizes the evaluation metrics and benchmarks for GUI agents, including:

Perception Evaluation: Assesses understanding of UI knowledge and environmental awareness using benchmarks like VisualWebBench, WebSRC, and ScreenQA, which focus on web understanding and mobile screen content understanding.
Grounding Evaluation: Assesses the ability to locate GUI elements precisely using benchmarks like ScreenSpot and ScreenSpot Pro, focusing on performance across various platforms and resolutions.
Offline Agent Capability Evaluation: Measures performance in static environments using benchmarks like AITW, Mind2Web, and AndroidControl, which provide task descriptions and require accurate action prediction. Metrics include action-matching score.
Online Agent Capability Evaluation: Assesses performance in dynamic environments using benchmarks like WebArena, OSWorld, and AndroidWorld, where agents modify environmental states in real-time. Task-level metrics are used to determine task success.

The paper details the architecture of UI-TARS, which iteratively receives observations, performs actions, and refines its reasoning through thoughts. The architecture is formally expressed as $(\text{instruction}, (o_1, t_1, a_1), (o_2, t_2, a_2), \cdots, (o_n, t_n, a_n))$ , where $o_i$ is the observation, $a_i$ is the action, and $t_i$ is the reasoning thought. The model predicts the thought $t_n$ and action $a_n$ iteratively, conditioned on the instruction and previous interactions, as shown in the equation: $P(t_n, a_n \mid \text{instruction}, t_1, a_1, \cdots, (o_{n-i}, t_{n-i}, a_{n-i})_{i=1}^N, o_n)$ . Where:

$P(t_n, a_n \mid \text{instruction}, t_1, a_1, \cdots, (o_{n-i}, t_{n-i}, a_{n-i})_{i=1}^N, o_n)$ is the conditional probability of the thought $t_n$ and action $a_n$ .
$\text{instruction}$ is the task instruction.
$o_i$ is the observation at time step $i$ .
$t_i$ is the model's thought at time step $i$ .
$a_i$ is the action taken at time step $i$ .
$N$ is the number of previous observations considered.

The paper describes methods for enhancing GUI perception, including:

Screenshot Collection: Building a large-scale dataset of screenshots and metadata from websites, apps, and operating systems, using parsing tools to extract element type, depth, bounding box, and text content.
Element Description: Creating detailed descriptions for each GUI element, covering element type, visual description, position information, and element function.
Dense Captioning: Providing comprehensive descriptions of GUI screenshots, capturing elements, spatial relationships, and overall layout.
State Transition Captioning: Identifying and describing differences between consecutive screenshots to capture the effects of actions on the interface.
Question Answering (QA): Synthesizing diverse QA data to enhance the agent's capacity for visual reasoning.
Set-of-Mark (SoM): Enhancing SoM prompting ability by drawing visually distinct markers for parsed elements on the GUI screenshot based on their spatial coordinates.

The paper describes the Unified Action Modeling and Grounding process:

Unified Action Space: Designing a common action space that standardizes semantically equivalent actions across devices, such as "click" on Windows versus "tap" on mobile, enabling knowledge transfer across platforms.
Action Trace Collection: Relying on annotated datasets and integrating existing datasets, standardizing them into a unified action space format.
Improving Grounding Ability: Training UI-TARS to directly predict the coordinates of the elements, associating GUI elements with spatial coordinates and metadata.

To infuse System-2 Reasoning the authors propose:

Reasoning Enrichment with GUI Tutorials: Leveraging publicly available tutorials that interweave text and images to demonstrate detailed user interactions across diverse software and web environments.

*Reasoning Stimulation with Thought Augmentation: Augmenting action trace data by annotating "thoughts" to bridge the gap between perception and action, following two annotation stages:

* ActRe: thoughts are generated iteratively by prompting a VLM with the previous context and the current target action * Thought Bootstrapping: A bootstrapping approach that generates thoughts without prior knowledge of the ground-truth action.

The paper explains how to learn from prior experience in long-term memory, including:

Online Trace Bootstrapping: Dynamically learning from interactions with real-world devices through semi-automated data collection, filtering, and refinement.
Reflection Tuning: Exposing the model to real-world errors made by itself with their corrections, enabling UI-TARS to learn how to recover from suboptimal decisions.
Agent Direct Preference Optimization (DPO): Leverages both the corrected and erroneous actions by introducing a reference-based objective, this approach optimizes UI-TARS by directly encoding a preference for corrected actions over erroneous ones, thereby making better use of the available data.

The UI-TARS model uses the Qwen-2-VL backbone and employs a three-phase training process:

Continual Pre-training Phase: Using the full set of data for continual pre-training.
Annealing Phase: Selecting high-quality subsets of data for annealing, promoting focused learning and better optimization. The model trained after this phase is UI-TARS-SFT.
DPO Phase: Employing annotated reflective pairs from online bootstrapping data for DPO training, improving the model's ability to make context-aware decisions. The final model is UI-TARS-DPO.

To evaluate UI-TARS, the authors compared it with baselines, including commercial models such as GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, and Gemini-2.0 (Project Mariner), as well as academic models from CogAgent, OminiParser, InternVL, Aria-UI, Aguvis, OS-Atlas, UGround, ShowUI, SeeClick, the Qwen series models QwenVL-7B, Qwen2-VL (7B and 72B), UIX-Qwen2-7B and Qwen-VL-Max.

The evaluations focused on:

Perception Capability Evaluation: Using VisualWebBench, WebSRC, and ScreenQA-short to measure the model's ability to understand web elements and screen content.
Grounding Capability Evaluation: Using ScreenSpot Pro, ScreenSpot, and ScreenSpot v2 to assess the ability to understand and localize elements in GUIs.
Offline Agent Capability Evaluation: Using Multimodal Mind2Web, Android Control, and GUI Odyssey to evaluate GUI agent capabilities in static environments.
Online Agent Capability Evaluation: Using OSWorld and AndroidWorld to evaluate performance in dynamic environments.
Comparing System 1 and System 2 Reasoning: Evaluating the effects of system-1 and system-2 reasoning on model performance.

In summary, UI-TARS shows that the native GUI agent can integrate perception, action, reasoning, and memory into a scalable and adaptive framework, achieving strong performance on GUI tasks with minimal human oversight.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/AdinaYakup/status/1882175353478910331

https://twitter.com/_akhaliq/status/1881929070566645785

https://twitter.com/dair_ai/status/1916503323840041379

https://twitter.com/fly51fly/status/1882194840718655854

https://twitter.com/shreyasaiyer/status/1882822684779884892

https://twitter.com/arxivsanitybot/status/1882252483323461684

YouTube

Show All Videos

HackerNews

UI-Tars: Pioneering Automated GUI Interaction with Native Agents (2 points, 0 comments)
UI-Tars: Pioneering Automated GUI Interaction with Native Agents (1 point, 0 comments)