Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

You Only Look at Screens: Multimodal Chain-of-Action Agents (2309.11436v4)

Published 20 Sep 2023 in cs.CL, cs.AI, and cs.HC

Abstract: Autonomous graphical user interface (GUI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of LLMs for effective engagement in diverse environments. To align with the input-output requirement of LLMs, most existing approaches are developed under a sandbox setting where they rely on external tools and application-specific APIs to parse the environment into textual elements and interpret the predicted actions. Consequently, those approaches often grapple with inference inefficiency and error propagation risks. To mitigate the challenges, we introduce Auto-GUI, a multimodal solution that directly interacts with the interface, bypassing the need for environment parsing or reliance on application-dependent APIs. Moreover, we propose a chain-of-action technique -- leveraging a series of intermediate previous action histories and future action plans -- to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30$K$ unique instructions, spanning multi-step tasks such as application operation, web searching, and web shopping. Experimental results show that Auto-GUI achieves state-of-the-art performance with an action type prediction accuracy of 90\% and an overall action success rate of 74\%. Code is publicly available at https://github.com/cooelf/Auto-GUI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Adept. Act-1: Transformer for actions. https://www.adept.ai/act, 2022.
  2. Aristotle. Physics 184a10–21.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org, 2023.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  7. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023.
  8. James Hendler. Is there an intelligent agent in your future? Nature, 11, 1999.
  9. Metagpt: Meta programming for multi-agent collaborative framework, 2023.
  10. Terence Irwin. Aristotle’s first principles. Clarendon Press, 1989.
  11. Large language models are zero-shot reasoners. ArXiv preprint, abs/2205.11916, 2022.
  12. On vision features in multimodal machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  6327–6337, 2022.
  13. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  14. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023.
  15. Pattie Maes. Agents that reduce work and information overload. In Readings in human–computer interaction, pp.  811–821. Elsevier, 1995.
  16. Yohei Nakajima. Babyagi. https://github.com/yoheinakajima/babyagi, 2023.
  17. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop, 2022.
  18. OpenAI. Gpt-4 technical report, 2023.
  19. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
  20. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
  21. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21:1–67, 2020.
  22. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088, 2023.
  23. Reworkd. Agentgpt. https://github.com/reworkd/AgentGPT, 2023.
  24. Toran Bruce Richards. Auto-gpt: An autonomous gpt-4 experiment. https://github.com/Significant-Gravitas/Auto-GPT, 2023.
  25. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2021.
  26. John R Searle. Speech acts: An essay in the philosophy of language, volume 626. Cambridge university press, 1969.
  27. Cognitive architectures for language agents. arXiv preprint arXiv:2309.02427, 2023.
  28. Meta-gui: Towards multi-modal conversational agents on mobile gui. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  6699–6712, 2022.
  29. Towards better semantic understanding of mobile interfaces. In Proceedings of the 29th International Conference on Computational Linguistics, pp.  5636–5650, 2022.
  30. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 2023a.
  31. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023b.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  33. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5998–6008, 2017.
  34. Enabling conversational interaction with mobile ui using large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp.  1–17, 2023a.
  35. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023b.
  36. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023c.
  37. Chain of thought prompting elicits reasoning in large language models. ArXiv preprint, abs/2201.11903, 2022.
  38. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272, 2023.
  39. Intelligent agents: Theory and practice. The knowledge engineering review, 10(2):115–152, 1995.
  40. Good for misconceived reasons: An empirical revisiting on the need for visual context in multimodal machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  6153–6166, Online, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.480.
  41. Screen recognition: Creating accessibility metadata for mobile applications from pixels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp.  1–15, 2021.
  42. Neural machine translation with universal visual representation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  43. Task compass: Scaling multi-task pre-training with task prefix. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  5671–5685, 2022.
  44. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations, 2023a.
  45. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023b.
  46. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.
  47. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Zhuosheng Zhang (125 papers)
  2. Aston Zhang (48 papers)
Citations (66)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets