Android in the Zoo: Chain-of-Action-Thought for GUI Agents (2403.02713v2)
Abstract: LLM leads to a surge of autonomous GUI agents for smartphone, which completes a task triggered by natural language through predicting a sequence of actions of API. Even though the task highly relies on past actions and visual observations, existing studies typically consider little semantic information carried out by intermediate screenshots and screen operations. To address this, this work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot setting upon three off-the-shelf LMMs, CoAT significantly improves the action prediction compared to previous proposed context modeling. To further facilitate the research in this line, we construct a dataset Android-In-The-Zoo (AitZ), which contains 18,643 screen-action pairs together with chain-of-action-thought annotations. Experiments show that fine-tuning a 1B model (i.e. AUTO-UI-base) on our AitZ dataset achieves on-par performance with CogAgent-Chat-18B.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Screenai: A vision-language model for ui and infographics understanding.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments. arXiv preprint arXiv:2104.08560.
- Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
- Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Applying learning-from-observation to household service robots: three common-sense formulation. arXiv preprint arXiv:2304.09966.
- Explore, select, derive, and recall: Augmenting llm with human-like memory for mobile task automation. arXiv preprint arXiv:2312.03003.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Interactive task learning from gui-grounded natural language instructions and demonstrations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 215–223.
- Mapping natural language instructions to mobile ui action sequences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8198–8210.
- Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Learning design semantics for mobile apps. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, pages 569–579.
- Physically assistive robots: A systematic review of mobile and manipulator robots that physically assist people with disabilities. Annual Review of Control, Robotics, and Autonomous Systems, 7.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088.
- Laion-5b: An open large-scale dataset for training next generation image-text models. ArXiv, abs/2210.08402.
- World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR.
- Meta-gui: Towards multi-modal conversational agents on mobile gui. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6699–6712.
- Meta-gui: Towards multi-modal conversational agents on mobile gui. In Conference on Empirical Methods in Natural Language Processing.
- Towards better semantic understanding of mobile interfaces. arXiv preprint arXiv:2210.02663.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212.
- Ugif: Ui grounded instruction following. arXiv preprint arXiv:2211.07615.
- Enabling conversational interaction with mobile ui using large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–17.
- Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498–510.
- Cogvlm: Visual expert for pretrained language models. ArXiv, abs/2311.03079.
- Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272.
- Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562.
- Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771.
- Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381.
- What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint arXiv:2307.02469.
- Zhuosheng Zhan and Aston Zhang. 2023. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436.
- Screen recognition: Creating accessibility metadata for mobile applications from pixels. association for computing machinery, new york, ny, usa.
- Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.