Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

UFO: A UI-Focused Agent for Windows OS Interaction (2402.07939v5)

Published 8 Feb 2024 in cs.HC, cs.AI, and cs.CL

Abstract: We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision. UFO employs a dual-agent framework to meticulously observe and analyze the graphical user interface (GUI) and control information of Windows applications. This enables the agent to seamlessly navigate and operate within individual applications and across them to fulfill user requests, even when spanning multiple applications. The framework incorporates a control interaction module, facilitating action grounding without human intervention and enabling fully automated execution. Consequently, UFO transforms arduous and time-consuming processes into simple tasks achievable solely through natural language commands. We conducted testing of UFO across 9 popular Windows applications, encompassing a variety of scenarios reflective of users' daily usage. The results, derived from both quantitative metrics and real-case studies, underscore the superior effectiveness of UFO in fulfilling user requests. To the best of our knowledge, UFO stands as the first UI agent specifically tailored for task completion within the Windows OS environment. The open-source code for UFO is available on https://github.com/microsoft/UFO.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. A comparative study of operating systems: Case of windows, unix, linux, mac, android and ios. International Journal of Computer Applications, 176(39):16–23, 2020.
  3. HE Bim and WANG Min-shuai. Application of pywinauto in software performance test. Computer and Modernization, (8):135, 2014.
  4. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  5. ChatEval: Towards better LLM-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.
  6. Harrison Chase. LangChain, October 2022. URL https://github.com/langchain-ai/langchain.
  7. Empowering practical root cause analysis by large language models for cloud incidents. arXiv preprint arXiv:2305.15778, 2023a.
  8. Imdiffusion: Imputed diffusion models for multivariate time series anomaly detection. arXiv preprint arXiv:2307.00754, 2023b.
  9. Everything of thoughts: Defying the law of penrose triangle for thought generation. arXiv preprint arXiv:2311.04254, 2023.
  10. A method for automated user interface testing of windows-based applications. In Proceedings of the 9th International Symposium on Information and Communication Technology, pp.  337–343, 2018.
  11. Agent AI: Surveying the horizons of multimodal interaction. arXiv preprint arXiv:2401.03568, 2024.
  12. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  13. MetaGPT: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023a.
  14. CogAgent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023b.
  15. Xpert: Empowering incident management with query recommendations via large language models. arXiv preprint arXiv:2312.11988, 2023.
  16. Assess and summarize: Improve outage understanding with large language models. arXiv preprint arXiv:2305.18084, 2023.
  17. GPTeval: Nlg evaluation using GPT-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
  18. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  19. TaskWeaver: A code-first agent framework. arXiv preprint arXiv:2311.17541, 2023.
  20. Gui scalability issues of windows desktop applications and how to find them. In Companion Proceedings for the ISSTA/ECOOP 2018 Workshops, pp.  63–67, 2018.
  21. Character-LLM: A trainable agent for role-playing. arXiv preprint arXiv:2310.10158, 2023.
  22. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  23. Significant Gravitas. AutoGPT. URL https://github.com/Significant-Gravitas/AutoGPT.
  24. William Stallings. The windows operating system. Operating Systems: Internals and Design Principles, 2005.
  25. Multi-agent collaboration: Harnessing the power of intelligent LLM agents. arXiv preprint arXiv:2306.03314, 2023.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  27. GPT-4V (ision) for robotics: Multimodal task planning from human demonstration. arXiv preprint arXiv:2311.12015, 2023.
  28. Mobile-Agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024.
  29. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023.
  30. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  31. Can GPT-4V (ision) serve medical applications? case studies on GPT-4V for multimodal medical diagnosis. arXiv preprint arXiv:2310.09909, 2023a.
  32. AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023b.
  33. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  34. GPT-4V in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023.
  35. AppAgent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023a.
  36. The dawn of lmms: Preliminary explorations with GPT-4V (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023b.
  37. GPT-4V (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361, 2023.
  38. GPT-4V (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
Citations (40)

Summary

  • The paper introduces UFO, a dual-agent framework that selects applications and executes GUI actions using GPT-Vision.
  • It outperforms GPT-3.5 and GPT-4 by completing tasks in fewer steps, achieving an 86% success rate across 9 Windows applications.
  • The design includes a safeguard mechanism prompting user confirmation for sensitive actions, ensuring reliability and security.

Unveiling UFO: A UI-Focused Agent Tailored for Windows OS Interaction Using GPT-Vision

Introduction to UFO

The paper presents UFO, an innovative UI-focused agent designed to interact with and navigate through applications on the Windows operating system. UFO addresses the need for a versatile agent capable of understanding and executing user requests across various applications without manual intervention. By leveraging GPT-Vision, UFO introduces a novel approach to automating tasks within Windows environments, setting a precedent as the first agent of its kind tailored specifically to Windows OS.

Framework Design and Implementation

The dual-agent framework of UFO, consisting of an Application Selection Agent (AppAgent) and an Action Selection Agent (ActAgent), forms the core of its operation. The AppAgent is responsible for choosing the appropriate application based on user requests and formulating a global plan for task completion. The ActAgent, on the other hand, executes actions within the selected application based on a local plan and observations from the application's GUI. This process incorporates a control interaction module, translating actions derived from GPT-Vision into executable operations on the application controls.

One of the distinguishing features of UFO is its capability to handle user requests that span multiple applications. This is achieved through an application-switching mechanism within the framework, enhancing UFO's ability to perform complex and multifaceted tasks. Further, the system's extensibility allows for the customization of actions and control operations, catering to specific application needs and tasks.

Experimental Evaluation

UFO was rigorously tested across 9 widely used Windows applications, covering a diverse range of scenarios to reflect daily computational needs. The comparison with baselines, including GPT-3.5 and GPT-4, demonstrates UFO's superior performance in executing tasks successfully with fewer steps and a higher completion rate. Notably, UFO achieved an 86% success rate across the evaluated tasks, significantly outperforming the baselines. This highlights UFO's effectiveness and efficiency in fulfilling user requests within the Windows OS environment.

The safety measures integrated into UFO, demonstrated through a safeguard rate of 85.7%, further underscore the system's reliability. UFO's design includes a safeguard mechanism that prompts user confirmation for sensitive actions, ensuring security and trustworthiness while automating tasks.

Implications and Future Directions

The development of UFO paves the way for significant advancements in UI automation and interaction within Windows environments. This research extends the capabilities of generative AI models, proposing a practical application in automating daily tasks through natural language commands. The initial success of UFO suggests vast potential for further development, including expanding support for diverse applications, enhancing control interactions for more complex tasks, and improving adaptability to unfamiliar UIs.

Furthermore, the open-source availability of UFO invites contributions from the wider research community, facilitating continuous improvement and customization. As generative AI and LLMs continue to evolve, tools like UFO will become increasingly integral in bridging the gap between AI capabilities and practical applications in various domains.

Conclusion

UFO represents a significant leap forward in the use of generative AI for UI-focused tasks within the Windows operating system. By combining the capabilities of GPT-Vision with a dedicated dual-agent framework, UFO offers a viable solution to automate complex tasks across multiple applications seamlessly. The promising outcomes from the experimental evaluation, along with its open-source model, set a solid foundation for future research and developments in AI-driven UI interactions. As this field progresses, UFO is poised to play a pivotal role in transforming how users interact with Windows OS, streamlining tasks, and enhancing productivity through advanced AI automation.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews