Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (2401.16158v2)

Published 29 Jan 2024 in cs.CL and cs.CV
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Abstract: Mobile device agent based on Multimodal LLMs (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobile system metadata, Mobile-Agent allows for greater adaptability across diverse mobile operating environments in a vision-centric way, thereby eliminating the necessity for system-specific customizations. To assess the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. The experimental results indicate that Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. Code and model will be open-sourced at https://github.com/X-PLUG/MobileAgent.

Introduction

The pursuit of autonomy and adaptability in mobile device agents has gained momentum with the advent of Multimodal LLMs (MLLM). However, the integration of visual perception in these agents presents significant challenges. Current state-of-the-art MLLMs like GPT-4V show limitations in connecting semantic understanding with precise visual perception, especially in mobile device operation contexts. Consequently, prior solutions have relied extensively on device-specific system files such as XML, which often face accessibility issues. This underlines a critical gap in realizing truly adaptable and system-agnostic mobile agents.

Mobile-Agent Architecture

To close this gap, a novel approach has emerged through the development of Mobile-Agent, a framework designed to provide an autonomous mobile device agent powered by visual perception capabilities. The architecture of Mobile-Agent pivots on a visual perception system consisting of detection and Optical Character Recognition (OCR) models. This empowers the agent to thoroughly dissect and comprehend the front-end interface of mobile apps solely based on screenshots, dispensing with the requirement for backend system file access. The intricate interplay of these models with the MLLM core enables the agent to undertake precise localization—identifying both icons and text—thus allowing accurate interaction with mobile user interfaces.

The operation suite defined within Mobile-Agent encompasses fundamental tasks, including opening apps and textual or icon-based navigation, augmented with advanced operations such as back navigation, exiting, and self-termination upon task completion. Importantly, the framework incorporates a self-planning algorithm that intelligently interprets screenshots in concert with the user’s instructions and operational history. Together with a self-reflection mechanism, Mobile-Agent can review its actions, course-correct erroneous steps, and ensure the execution of complex multi-step tasks.

Performance Evaluation

The efficacy of Mobile-Agent was rigorously tested via Mobile-Eval, a benchmark specifically developed for this purpose encompassing 10 popular apps and a spectrum of tasks varying in complexity. A comprehensive evaluation across this benchmark yielded promising results, attesting to the agent's high success rates and operational precision even when navigating multifaceted tasks and multi-app operations. Remarkably, Mobile-Agent's ability to execute tasks was not far from that of human-level performance, speaking volumes about its potential to transform mobile device interaction.

Discussion and Future Directions

By situating Mobile-Agent within the larger context of LLM-based agents, this work delineates a leap forward in enabling LLMs to handle mobile devices adeptly. Unlike previous agents that augment GPT-4V's capability through device system metadata, Mobile-Agent maintains a pure vision-centric approach, ensuring greater portability and efficiency across different operating systems and environments.

The autonomous operational dexterity of Mobile-Agent positions it as a significant contribution to the domain, with Mobile-Eval serving as a testament to the feasibility of such autonomously guided agents in complex mobile navigation and task execution. The open sourcing of Mobile-Agent’s code and model presents an opportunity for community-wide enhancement and expansion, setting the stage for further innovation and application in mobile agent technology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Modelscope-agent: Building your customizable agent system with open-source large language models. arXiv preprint arXiv:2309.00986, 2023.
  2. Controlllm: Augment language models with tools by searching on graphs. arXiv preprint arXiv:2310.17796, 2023a.
  3. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language. arXiv preprint arXiv:2305.05662, 3, 2023b.
  4. Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437, 2023c.
  5. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  6. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  7. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv preprint arXiv:2305.18752, 2023a.
  8. Small llms are weak tool learners: A multi-llm agent. arXiv preprint arXiv:2401.07324, 2024.
  9. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023b.
  10. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
  11. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224, 2023c.
  12. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023d.
  13. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  14. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023a.
  15. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  16. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023e.
  17. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
  18. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023b.
  19. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  20. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023.
  21. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
  22. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023d.
  23. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  24. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023f.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Junyang Wang (24 papers)
  2. Haiyang Xu (67 papers)
  3. Jiabo Ye (17 papers)
  4. Ming Yan (190 papers)
  5. Weizhou Shen (18 papers)
  6. Ji Zhang (176 papers)
  7. Fei Huang (408 papers)
  8. Jitao Sang (71 papers)
Citations (58)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com