UFO: A UI-Focused Agent for Windows OS Interaction (2402.07939v5)

Published 8 Feb 2024 in cs.HC, cs.AI, and cs.CL

Abstract: We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision. UFO employs a dual-agent framework to meticulously observe and analyze the graphical user interface (GUI) and control information of Windows applications. This enables the agent to seamlessly navigate and operate within individual applications and across them to fulfill user requests, even when spanning multiple applications. The framework incorporates a control interaction module, facilitating action grounding without human intervention and enabling fully automated execution. Consequently, UFO transforms arduous and time-consuming processes into simple tasks achievable solely through natural language commands. We conducted testing of UFO across 9 popular Windows applications, encompassing a variety of scenarios reflective of users' daily usage. The results, derived from both quantitative metrics and real-case studies, underscore the superior effectiveness of UFO in fulfilling user requests. To the best of our knowledge, UFO stands as the first UI agent specifically tailored for task completion within the Windows OS environment. The open-source code for UFO is available on https://github.com/microsoft/UFO.

References (38)

Citations (40)

View on Semantic Scholar

Summary

The paper introduces UFO, a dual-agent framework that selects applications and executes GUI actions using GPT-Vision.
It outperforms GPT-3.5 and GPT-4 by completing tasks in fewer steps, achieving an 86% success rate across 9 Windows applications.
The design includes a safeguard mechanism prompting user confirmation for sensitive actions, ensuring reliability and security.

Unveiling UFO: A UI-Focused Agent Tailored for Windows OS Interaction Using GPT-Vision

Introduction to UFO

The paper presents UFO, an innovative UI-focused agent designed to interact with and navigate through applications on the Windows operating system. UFO addresses the need for a versatile agent capable of understanding and executing user requests across various applications without manual intervention. By leveraging GPT-Vision, UFO introduces a novel approach to automating tasks within Windows environments, setting a precedent as the first agent of its kind tailored specifically to Windows OS.

Framework Design and Implementation

The dual-agent framework of UFO, consisting of an Application Selection Agent (AppAgent) and an Action Selection Agent (ActAgent), forms the core of its operation. The AppAgent is responsible for choosing the appropriate application based on user requests and formulating a global plan for task completion. The ActAgent, on the other hand, executes actions within the selected application based on a local plan and observations from the application's GUI. This process incorporates a control interaction module, translating actions derived from GPT-Vision into executable operations on the application controls.

One of the distinguishing features of UFO is its capability to handle user requests that span multiple applications. This is achieved through an application-switching mechanism within the framework, enhancing UFO's ability to perform complex and multifaceted tasks. Further, the system's extensibility allows for the customization of actions and control operations, catering to specific application needs and tasks.

Experimental Evaluation

UFO was rigorously tested across 9 widely used Windows applications, covering a diverse range of scenarios to reflect daily computational needs. The comparison with baselines, including GPT-3.5 and GPT-4, demonstrates UFO's superior performance in executing tasks successfully with fewer steps and a higher completion rate. Notably, UFO achieved an 86% success rate across the evaluated tasks, significantly outperforming the baselines. This highlights UFO's effectiveness and efficiency in fulfilling user requests within the Windows OS environment.

The safety measures integrated into UFO, demonstrated through a safeguard rate of 85.7%, further underscore the system's reliability. UFO's design includes a safeguard mechanism that prompts user confirmation for sensitive actions, ensuring security and trustworthiness while automating tasks.

Implications and Future Directions

The development of UFO paves the way for significant advancements in UI automation and interaction within Windows environments. This research extends the capabilities of generative AI models, proposing a practical application in automating daily tasks through natural language commands. The initial success of UFO suggests vast potential for further development, including expanding support for diverse applications, enhancing control interactions for more complex tasks, and improving adaptability to unfamiliar UIs.

Furthermore, the open-source availability of UFO invites contributions from the wider research community, facilitating continuous improvement and customization. As generative AI and LLMs continue to evolve, tools like UFO will become increasingly integral in bridging the gap between AI capabilities and practical applications in various domains.

Conclusion

UFO represents a significant leap forward in the use of generative AI for UI-focused tasks within the Windows operating system. By combining the capabilities of GPT-Vision with a dedicated dual-agent framework, UFO offers a viable solution to automate complex tasks across multiple applications seamlessly. The promising outcomes from the experimental evaluation, along with its open-source model, set a solid foundation for future research and developments in AI-driven UI interactions. As this field progresses, UFO is poised to play a pivotal role in transforming how users interact with Windows OS, streamlining tasks, and enhancing productivity through advanced AI automation.