Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation (2312.13108v2)

Published 20 Dec 2023 in cs.CV

Abstract: Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity. Existing works leveraging LLM or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However, these tasks are primarily aimed at simple device usage and entertainment operations. This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. We carefully collected a set of 100 tasks from nine widely-used software applications, such as, After Effects and MS Word, each accompanied by the necessary project files for better evaluation. Moreover, we propose an advanced Actor-Critic Embodied Agent framework, which incorporates a sophisticated GUI parser driven by an LLM-agent and an enhanced reasoning mechanism adept at handling lengthy procedural tasks. Our experimental results reveal that our GUI Parser and Reasoning mechanism outshine existing methods in performance. Nevertheless, the potential remains substantial, with the best model attaining only a 46% success rate on our benchmark. We conclude with a thorough analysis of the current methods' limitations, setting the stage for future breakthroughs in this domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Uibert: Learning generic multimodal representations for ui understanding. arXiv preprint arXiv:2107.13731, 2021.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  3. Lexi: Self-supervised learning of the ui language. arXiv preprint arXiv:2301.10165, 2023.
  4. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  5. A dataset for interactive vision-language navigation with unknown command feasibility. In European Conference on Computer Vision, pages 312–328. Springer, 2022.
  6. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
  7. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023.
  8. Learning to navigate the web. arXiv preprint arXiv:1812.09195, 2018.
  9. Actionbert: Leveraging user actions for semantic understanding of user interfaces. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5931–5938, 2021.
  10. A data-driven approach for learning to control computers. In International Conference on Machine Learning, pages 9466–9482. PMLR, 2022.
  11. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023.
  12. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  13. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
  14. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
  15. A zero-shot language agent for computer control with structured reflection. arXiv preprint arXiv:2310.08740, 2023.
  16. Mapping natural language instructions to mobile ui action sequences. arXiv preprint arXiv:2005.03776, 2020.
  17. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. arXiv preprint arXiv:1802.08979, 2018.
  18. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023a.
  19. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023b.
  20. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842, 2023.
  21. OpenAI. Introducing chatgpt. OpenAI Blog, 2021.
  22. OpenAI. Gpt-4 technical report, 2023a.
  23. OpenAI. Gpt-4v(ision) system card., 2023b.
  24. ART: automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023.
  25. Mapping natural language commands to web elements. arXiv preprint arXiv:1808.09132, 2018.
  26. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088, 2023.
  27. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  28. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  29. mobile-env: An open platform for reinforcement learning in wireless mobile networks. In NOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium, pages 1–3. IEEE, 2022.
  30. Dehpour Sep. Deepdiff (version 6.7.1). https://github.com/seperman/deepdiff, 2023.
  31. From pixels to ui actions: Learning to follow instructions via graphical user interfaces. arXiv preprint arXiv:2306.00245, 2023.
  32. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  33. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR, 2017.
  34. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  35. Meta-gui: Towards multi-modal conversational agents on mobile gui. arXiv preprint arXiv:2205.11029, 2022.
  36. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  37. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  40. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231, 2021.
  41. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  42. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272, 2023a.
  43. Droidbot-gpt: Gpt-powered ui automation for android. arXiv preprint arXiv:2304.07061, 2023b.
  44. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
  45. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  46. Visual clues: Bridging vision and language foundations for image paragraph captioning. Advances in Neural Information Processing Systems, 35:17287–17300, 2022.
  47. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023.
  48. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023a.
  49. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023b.
  50. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022a.
  51. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022b.
  52. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436, 2023.
  53. Reinforced ui instruction grounding: Towards a generic ui task automation api. arXiv preprint arXiv:2310.04716, 2023.
  54. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Difei Gao (32 papers)
  2. Lei Ji (33 papers)
  3. Zechen Bai (17 papers)
  4. Mingyu Ouyang (4 papers)
  5. Peiran Li (19 papers)
  6. Dongxing Mao (8 papers)
  7. Qinchen Wu (4 papers)
  8. Weichen Zhang (14 papers)
  9. Peiyi Wang (48 papers)
  10. Xiangwu Guo (2 papers)
  11. Hengxu Wang (2 papers)
  12. Luowei Zhou (31 papers)
  13. Mike Zheng Shou (165 papers)
Citations (16)
X Twitter Logo Streamline Icon: https://streamlinehq.com