UFO: A UI-Focused Agent for Windows OS Interaction (2402.07939v5)
Abstract: We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision. UFO employs a dual-agent framework to meticulously observe and analyze the graphical user interface (GUI) and control information of Windows applications. This enables the agent to seamlessly navigate and operate within individual applications and across them to fulfill user requests, even when spanning multiple applications. The framework incorporates a control interaction module, facilitating action grounding without human intervention and enabling fully automated execution. Consequently, UFO transforms arduous and time-consuming processes into simple tasks achievable solely through natural language commands. We conducted testing of UFO across 9 popular Windows applications, encompassing a variety of scenarios reflective of users' daily usage. The results, derived from both quantitative metrics and real-case studies, underscore the superior effectiveness of UFO in fulfilling user requests. To the best of our knowledge, UFO stands as the first UI agent specifically tailored for task completion within the Windows OS environment. The open-source code for UFO is available on https://github.com/microsoft/UFO.
- GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- A comparative study of operating systems: Case of windows, unix, linux, mac, android and ios. International Journal of Computer Applications, 176(39):16–23, 2020.
- HE Bim and WANG Min-shuai. Application of pywinauto in software performance test. Computer and Modernization, (8):135, 2014.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
- ChatEval: Towards better LLM-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.
- Harrison Chase. LangChain, October 2022. URL https://github.com/langchain-ai/langchain.
- Empowering practical root cause analysis by large language models for cloud incidents. arXiv preprint arXiv:2305.15778, 2023a.
- Imdiffusion: Imputed diffusion models for multivariate time series anomaly detection. arXiv preprint arXiv:2307.00754, 2023b.
- Everything of thoughts: Defying the law of penrose triangle for thought generation. arXiv preprint arXiv:2311.04254, 2023.
- A method for automated user interface testing of windows-based applications. In Proceedings of the 9th International Symposium on Information and Communication Technology, pp. 337–343, 2018.
- Agent AI: Surveying the horizons of multimodal interaction. arXiv preprint arXiv:2401.03568, 2024.
- Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
- MetaGPT: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023a.
- CogAgent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023b.
- Xpert: Empowering incident management with query recommendations via large language models. arXiv preprint arXiv:2312.11988, 2023.
- Assess and summarize: Improve outage understanding with large language models. arXiv preprint arXiv:2305.18084, 2023.
- GPTeval: Nlg evaluation using GPT-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- TaskWeaver: A code-first agent framework. arXiv preprint arXiv:2311.17541, 2023.
- Gui scalability issues of windows desktop applications and how to find them. In Companion Proceedings for the ISSTA/ECOOP 2018 Workshops, pp. 63–67, 2018.
- Character-LLM: A trainable agent for role-playing. arXiv preprint arXiv:2310.10158, 2023.
- Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
- Significant Gravitas. AutoGPT. URL https://github.com/Significant-Gravitas/AutoGPT.
- William Stallings. The windows operating system. Operating Systems: Internals and Design Principles, 2005.
- Multi-agent collaboration: Harnessing the power of intelligent LLM agents. arXiv preprint arXiv:2306.03314, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- GPT-4V (ision) for robotics: Multimodal task planning from human demonstration. arXiv preprint arXiv:2311.12015, 2023.
- Mobile-Agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024.
- A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Can GPT-4V (ision) serve medical applications? case studies on GPT-4V for multimodal medical diagnosis. arXiv preprint arXiv:2310.09909, 2023a.
- AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023b.
- The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
- GPT-4V in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023.
- AppAgent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023a.
- The dawn of lmms: Preliminary explorations with GPT-4V (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023b.
- GPT-4V (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361, 2023.
- GPT-4V (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.