UFO2: The Desktop AgentOS (2504.14603v2)

Published 20 Apr 2025 in cs.AI, cs.HC, and cs.OS

Abstract: Recent Computer-Using Agents (CUAs), powered by multimodal LLMs, offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot-based interaction, and disruptive execution. We present UFO2, a multiagent AgentOS for Windows desktops that elevates CUAs into practical, system-level automation. UFO2 features a centralized HostAgent for task decomposition and coordination, alongside a collection of application-specialized AppAgent equipped with native APIs, domain-specific knowledge, and a unified GUI--API action layer. This architecture enables robust task execution while preserving modularity and extensibility. A hybrid control detection pipeline fuses Windows UI Automation (UIA) with vision-based parsing to support diverse interface styles. Runtime efficiency is further enhanced through speculative multi-action planning, reducing per-step LLM overhead. Finally, a Picture-in-Picture (PiP) interface enables automation within an isolated virtual desktop, allowing agents and users to operate concurrently without interference. We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.

PDF Abstract

Overview of UFO: The Desktop AgentOS

The paper introduces UFO, a multiagent AgentOS designed specifically for Windows desktop automation, that aims to transform Computer-Using Agents (CUAs) from mere conceptual prototypes to practical, user-oriented solutions. This system leverages recent advancements in multimodal LLMs to automate complex desktop workflows, emphasizing deep OS-level integration and system-wide orchestration.

The UFO architecture consists of a centralized HostAgent responsible for task decomposition and coordination, as well as a suite of application-specialized AppAgents. These AppAgents are equipped with native APIs, domain-specific knowledge, and a unified GUI–API action layer, which allows for more robust and efficient task execution. A hybrid control detection pipeline combines Windows UI Automation (UIA) with vision-based parsing to enhance interface recognition, while speculative multi-action planning reduces LLM overhead per step. The introduction of a Picture-in-Picture (PiP) interface allows automation to proceed in an isolated virtual desktop, enabling seamless multitasking between agents and users without interference.

Key Contributions

Deep OS Integration: UFO deeply embeds automation capabilities within the Windows OS, coordinating desktop applications through introspection, API access, and execution control. This integration aims to overcome the challenges faced by traditional CUAs, which often rely on superficial GUI interactions.
Unified GUI–API Action Layer: By bridging traditional GUI actions and application-native API calls, UFO facilitates flexible and robust automation, reducing the fragility of automation scripts and enhancing execution efficiency.
Hybrid Control Detection: The paper introduces a fusion pipeline that combines UIA metadata with vision-based detection to achieve reliable control grounding across standard and non-standard interfaces.
Continuous Knowledge Integration: A retrieval-augmented memory system integrates external documentation and historical execution logs, allowing agents to refine their behavior autonomously over time without needing retraining.
Speculative Multi-Action Execution: UFO features an innovative execution strategy that predicts likely action sequences and validates them using lightweight control-state checks at each inference step, significantly reducing inference overhead.
Non-Disruptive UX: The PiP interface creates a virtual desktop where automation can proceed in parallel with user activity, avoiding the interference common in traditional CUAs.
Comprehensive Evaluation: The evaluation of UFO across 20+ real-world Windows applications demonstrates its substantial improvements in success rate, execution efficiency, and usability over existing CUAs like Operator.

System Design and Components

The UFO system is fundamentally multiagent, consisting of HostAgent for centralized orchestration and specialized AppAgents tailored to individual Windows applications. The AppAgents employ a ReAct-style control loop, iteratively observing application states, formulating plans, and executing actions. This structured execution loop benefits from the dual perception layer—visual and semantic—that captures GUI screenshots and UIA metadata respectively to enhance decision-making accuracy.

The introduction of the Puppeteer as a unified execution engine transforms action execution, dynamically selecting between GUI-level and API-based actions. This flexibility improves execution robustness and efficiency while maintaining modularity and extensibility.

The paper also discusses the practical implementation and engineering design considerations, emphasizing session-based execution, safety mechanisms, and extensibility through an agent registry that encapsulates third-party components as AppAgents.

Evaluation and Results

UFO was evaluated comprehensively across two established Windows-centric automation benchmarks, demonstrating a significant improvement in success rates. The data reveals that even in regions where UIA alone fails due to non-standard interfaces, the hybrid control detection strategy effectively increases task completions. Moreover, integrating API-based actions with GUI interactions enhances robustness and efficiency, reducing the average completion steps significantly compared to GUI-only interactions.

The paper highlights an interesting case paper where the speculative multi-action execution approach consolidates multiple steps into a single LLM call, showcasing potential latency reductions without compromising reliability.

Implications and Future Work

The implications of this research span both theoretical and practical domains. The integration of deep OS-level capabilities with advanced multimodal LLMs suggests a paradigm shift from surface-level GUI agents toward system-level abstractions for desktop automation. The enhancements presented not only improve robustness and execution accuracy but also challenge existing perceptions of scalability and user-aligned automation within CUAs.

Future work may involve further optimization of execution latency and steps towards achieving human-level performance. Additionally, exploring cross-platform implementations could pave the way for unified ecosystems of desktop automation solutions across diverse operating systems.

Overall, UFO represents a substantial contribution to the field of desktop automation by embedding CUAs within the OS as robust, scalable, and practical solutions. This architectural advancement holds promise for the widespread adoption and application of intelligent, language-driven automation in everyday computing environments.