Overcoming limitations of pure GUI-only operation

Design hybrid GUI-centered interaction environments that allow GUI agents to interoperate seamlessly with file systems, terminals, and external tools, thereby overcoming the insufficiency of pure GUI manipulation for realistic workflows such as data processing, software development, and system administration.

Background

The authors argue that pure GUI interaction is often inadequate for real-world tasks, which are more naturally handled via terminals, file systems, or external tools. They introduce a hybrid environment and GUI-SDK to broaden capabilities, but they identify the overarching challenge of moving beyond GUI-only operation as an open problem.

This motivates their environment design choices (shared file system, terminal access, tool invocation) to enable richer workflows and stronger performance on tasks requiring system-level capabilities.

References

While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability.

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning  (2509.02544 - Wang et al., 2 Sep 2025) in Abstract (Page 1)