VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning (2312.11190v2)

Published 18 Dec 2023 in cs.HC

Abstract: Mobile task automation is an emerging field that leverages AI to streamline and optimize the execution of routine tasks on mobile devices, thereby enhancing efficiency and productivity. Traditional methods, such as Programming By Demonstration (PBD), are limited due to their dependence on predefined tasks and susceptibility to app updates. Recent advancements have utilized the view hierarchy to collect UI information and employed LLMs (LLM) to enhance task automation. However, view hierarchies have accessibility issues and face potential problems like missing object descriptions or misaligned structures. This paper introduces VisionTasker, a two-stage framework combining vision-based UI understanding and LLM task planning, for mobile task automation in a step-by-step manner. VisionTasker firstly converts a UI screenshot into natural language interpretations using a vision-based UI understanding approach, eliminating the need for view hierarchies. Secondly, it adopts a step-by-step task planning method, presenting one interface at a time to the LLM. The LLM then identifies relevant elements within the interface and determines the next action, enhancing accuracy and practicality. Extensive experiments show that VisionTasker outperforms previous methods, providing effective UI representations across four datasets. Additionally, in automating 147 real-world tasks on an Android smartphone, VisionTasker demonstrates advantages over humans in tasks where humans show unfamiliarity and shows significant improvements when integrated with the PBD mechanism. VisionTasker is open-source and available at https://github.com/AkimotoAyako/VisionTasker.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (51)

Authors (5)

Yunpeng Song (12 papers)
Yiheng Bian (5 papers)
Yongtao Tang (2 papers)
Zhongmin Cai (5 papers)
Guiyu Ma (2 papers)

Citations (2)

View on Semantic Scholar

VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning (2312.11190v2)

Related Papers