VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning (2312.11190v2)

Abstract: Mobile task automation is an emerging field that leverages AI to streamline and optimize the execution of routine tasks on mobile devices, thereby enhancing efficiency and productivity. Traditional methods, such as Programming By Demonstration (PBD), are limited due to their dependence on predefined tasks and susceptibility to app updates. Recent advancements have utilized the view hierarchy to collect UI information and employed LLMs (LLM) to enhance task automation. However, view hierarchies have accessibility issues and face potential problems like missing object descriptions or misaligned structures. This paper introduces VisionTasker, a two-stage framework combining vision-based UI understanding and LLM task planning, for mobile task automation in a step-by-step manner. VisionTasker firstly converts a UI screenshot into natural language interpretations using a vision-based UI understanding approach, eliminating the need for view hierarchies. Secondly, it adopts a step-by-step task planning method, presenting one interface at a time to the LLM. The LLM then identifies relevant elements within the interface and determines the next action, enhancing accuracy and practicality. Extensive experiments show that VisionTasker outperforms previous methods, providing effective UI representations across four datasets. Additionally, in automating 147 real-world tasks on an Android smartphone, VisionTasker demonstrates advantages over humans in tasks where humans show unfamiliarity and shows significant improvements when integrated with the PBD mechanism. VisionTasker is open-source and available at https://github.com/AkimotoAyako/VisionTasker.
- Gary Ang and Ee-Peng Lim. 2022. Learning Semantically Rich Network-based Multi-modal Mobile User Interface Embeddings. ACM Transactions on Interactive Intelligent Systems 12, 4 (2022), 1–29.
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Vins: Visual search for mobile user interface design. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14.
- Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments. arXiv preprint arXiv:2104.08560 (2021).
- A dataset for interactive vision-language navigation with unknown command feasibility. In European Conference on Computer Vision. Springer, 312–328.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
- Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology. 845–854.
- SVTR: Scene Text Recognition with a Single Visual Model. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence Organization, 884–890. https://doi.org/10.24963/ijcai.2022/124 Main Track.
- Understanding Mobile GUI: from Pixel-Words to Screen-Sentences. ArXiv abs/2105.11941 (2021). https://api.semanticscholar.org/CorpusID:235187035
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369–376.
- Gtc: Guided training of ctc towards efficient and accurate scene text recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 11005–11012.
- ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations. arXiv preprint arXiv:2310.04869 (2023).
- YOLO by Ultralytics. https://github.com/ultralytics/ultralytics
- Segment anything. arXiv preprint arXiv:2304.02643 (2023).
- Rebecca Krosnick and Steve Oney. 2022. ParamMacros: Creating UI Automation Leveraging End-User Natural Language Parameterization. In 2022 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 1–10.
- Describing ui screenshots in natural language. ACM Transactions on Intelligent Systems and Technology 14, 1 (2022), 1–28.
- Learning to denoise raw mobile UI layouts for improving datasets at scale. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–13.
- Gang Li and Yang Li. 2023. Spotlight: Mobile UI understanding using vision-language models with a focus. (2023).
- What You See Is What You Get? It Is Not the Case! Detecting Misleading Icons for Mobile Applications. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 538–550.
- SUGILITE: creating multimodal smartphone automation by demonstration. In Proceedings of the 2017 CHI conference on human factors in computing systems. 6038–6049.
- Appinite: A multi-modal interface for specifying data descriptions in programming by demonstration using natural language instructions. In 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 105–114.
- Screen2vec: Semantic embedding of gui screens and gui components. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
- Toby Jia-Jun Li and Oriana Riva. 2018. KITE: Building conversational bots from mobile apps. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services. 96–109.
- Mapping natural language instructions to mobile UI action sequences. arXiv preprint arXiv:2005.03776 (2020).
- Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 5495–5510. https://doi.org/10.18653/v1/2020.emnlp-main.443
- Chatting with GPT-3 for Zero-Shot Human-Like Mobile Automated GUI Testing. arXiv preprint arXiv:2305.09434 (2023).
- On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 1906–1919. https://doi.org/10.18653/v1/2020.acl-main.173
- Jakob Nielsen. 1994. Enhancing the explanatory power of usability heuristics. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems. 152–158.
- Automatically generating and improving voice command interface from operation sequences on smartphones. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–21.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- AndroidInTheWild: A Large-Scale Dataset For Android Device Control. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Oriana Riva and Jason Kace. 2021. Etna: Harvesting action graphs from websites. In The 34th Annual ACM Symposium on User Interface Software and Technology. 312–331.
- Examining image-based button labeling for accessibility in Android apps through large-scale analysis. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility. 119–130.
- VASTA: a vision and language-assisted smartphone task automation system. In Proceedings of the 25th international conference on intelligent user interfaces. 22–32.
- META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI. arXiv preprint arXiv:2205.11029 (2022).
- UGIF: UI Grounded Instruction Following. arXiv preprint arXiv:2211.07615 (2022).
- Voicify Your UI: Towards Android App Control with Voice Commands. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 7, 1 (2023), 1–22.
- Enabling conversational interaction with mobile ui using large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17.
- Screen2words: Automatic mobile UI summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology. 498–510.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272 (2023).
- Max Wertheimer. 1938. Laws of organization in perceptual forms. (1938).
- Screen parsing: Towards reverse engineering of UI models from screenshots. In The 34th Annual ACM Symposium on User Interface Software and Technology. 470–483.
- XDA. 2021. Google is trying to limit what apps can use an Accessibility Service (again). https://www.xda-developers.com/google-trying-limit-apps-accessibility-service/
- C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv:2309.07597 [cs.CL]
- Psychologically-inspired, unsupervised inference of perceptual groups of GUI widgets from GUI images. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 332–343.
- GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation. arXiv preprint arXiv:2311.07562 (2023).
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 (2023).
- Zhuosheng Zhan and Aston Zhang. 2023. You Only Look at Screens: Multimodal Chain-of-Action Agents. arXiv preprint arXiv:2309.11436 (2023).
- Screen recognition: Creating accessibility metadata for mobile applications from pixels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
- Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators. arXiv preprint arXiv:2306.01242 (2023).
- Yunpeng Song (12 papers)
- Yiheng Bian (5 papers)
- Yongtao Tang (2 papers)
- Zhongmin Cai (5 papers)
- Guiyu Ma (2 papers)