Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (2401.16158v2)

Published 29 Jan 2024 in cs.CL and cs.CV

Abstract: Mobile device agent based on Multimodal LLMs (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobile system metadata, Mobile-Agent allows for greater adaptability across diverse mobile operating environments in a vision-centric way, thereby eliminating the necessity for system-specific customizations. To assess the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. The experimental results indicate that Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. Code and model will be open-sourced at https://github.com/X-PLUG/MobileAgent.

PDF HTML Abstract

Introduction

The pursuit of autonomy and adaptability in mobile device agents has gained momentum with the advent of Multimodal LLMs (MLLM). However, the integration of visual perception in these agents presents significant challenges. Current state-of-the-art MLLMs like GPT-4V show limitations in connecting semantic understanding with precise visual perception, especially in mobile device operation contexts. Consequently, prior solutions have relied extensively on device-specific system files such as XML, which often face accessibility issues. This underlines a critical gap in realizing truly adaptable and system-agnostic mobile agents.

Mobile-Agent Architecture

To close this gap, a novel approach has emerged through the development of Mobile-Agent, a framework designed to provide an autonomous mobile device agent powered by visual perception capabilities. The architecture of Mobile-Agent pivots on a visual perception system consisting of detection and Optical Character Recognition (OCR) models. This empowers the agent to thoroughly dissect and comprehend the front-end interface of mobile apps solely based on screenshots, dispensing with the requirement for backend system file access. The intricate interplay of these models with the MLLM core enables the agent to undertake precise localization—identifying both icons and text—thus allowing accurate interaction with mobile user interfaces.

The operation suite defined within Mobile-Agent encompasses fundamental tasks, including opening apps and textual or icon-based navigation, augmented with advanced operations such as back navigation, exiting, and self-termination upon task completion. Importantly, the framework incorporates a self-planning algorithm that intelligently interprets screenshots in concert with the user’s instructions and operational history. Together with a self-reflection mechanism, Mobile-Agent can review its actions, course-correct erroneous steps, and ensure the execution of complex multi-step tasks.

Performance Evaluation

The efficacy of Mobile-Agent was rigorously tested via Mobile-Eval, a benchmark specifically developed for this purpose encompassing 10 popular apps and a spectrum of tasks varying in complexity. A comprehensive evaluation across this benchmark yielded promising results, attesting to the agent's high success rates and operational precision even when navigating multifaceted tasks and multi-app operations. Remarkably, Mobile-Agent's ability to execute tasks was not far from that of human-level performance, speaking volumes about its potential to transform mobile device interaction.

Discussion and Future Directions

By situating Mobile-Agent within the larger context of LLM-based agents, this work delineates a leap forward in enabling LLMs to handle mobile devices adeptly. Unlike previous agents that augment GPT-4V's capability through device system metadata, Mobile-Agent maintains a pure vision-centric approach, ensuring greater portability and efficiency across different operating systems and environments.

The autonomous operational dexterity of Mobile-Agent positions it as a significant contribution to the domain, with Mobile-Eval serving as a testament to the feasibility of such autonomously guided agents in complex mobile navigation and task execution. The open sourcing of Mobile-Agent’s code and model presents an opportunity for community-wide enhancement and expansion, setting the stage for further innovation and application in mobile agent technology.