AppAgentX: Evolving GUI Agents as Proficient Smartphone Users (2503.02268v3)

Published 4 Mar 2025 in cs.AI

Abstract: Recent advancements in LLMs have led to the development of intelligent LLM-based agents capable of interacting with graphical user interfaces (GUIs). These agents demonstrate strong reasoning and adaptability, enabling them to perform complex tasks that traditionally required predefined rules. However, the reliance on step-by-step reasoning in LLM-based agents often results in inefficiencies, particularly for routine tasks. In contrast, traditional rule-based systems excel in efficiency but lack the intelligence and flexibility to adapt to novel scenarios. To address this challenge, we propose a novel evolutionary framework for GUI agents that enhances operational efficiency while retaining intelligence and flexibility. Our approach incorporates a memory mechanism that records the agent's task execution history. By analyzing this history, the agent identifies repetitive action sequences and evolves high-level actions that act as shortcuts, replacing these low-level operations and improving efficiency. This allows the agent to focus on tasks requiring more complex reasoning, while simplifying routine actions. Experimental results on multiple benchmark tasks demonstrate that our approach significantly outperforms existing methods in both efficiency and accuracy. The code will be open-sourced to support further research.

PDF Abstract

The paper introduces AppAgentX, an evolutionary framework designed to enhance the efficiency and intelligence of GUI agents interacting with smartphone interfaces. The core challenge addressed is the inefficiency of LLM-based agents, which rely on step-by-step reasoning, particularly for routine tasks, contrasting with the rigidity of traditional rule-based systems. AppAgentX aims to bridge this gap by enabling agents to learn from past interactions, evolving high-level actions that streamline repetitive operations.

The key components of AppAgentX include a memory mechanism, an evolutionary mechanism, and an action execution strategy.

The memory mechanism records the agent's operational history as a series of page transitions, where each UI page is represented as a "page node." These nodes contain attributes such as:

Page Description: Textual description of the UI page.
Element List: JSON list of detected elements with screen position, OCR results, etc., obtained using OmniParser (Lu et al., 1 Aug 2024 ).
Other Properties: Screenshots, IDs, timestamps, etc.

"Element nodes" correspond to specific UI elements and include:

Element Description: Textual description of the element's functionality.
Element Visual Embeddings: Identifiers of element screenshots in a vector database, with visual features extracted using a pre-trained ResNet50 model [he2015deepresiduallearningimage].
Interaction Details: Basic action details like tapping, arguments, and other interaction details.

The agent's trajectory is decomposed into overlapping triples (source page, action, target page), which are processed by the LLM to generate functional descriptions for page and element nodes. Overlapping descriptions from different triples are merged by the LLM to create unified descriptions for each page node.

The evolutionary mechanism improves accuracy and efficiency by identifying repetitive patterns in action sequences. Shortcut nodes, representing high-level actions, replace inefficient low-level operations. The LLM determines if a task contains optimizable patterns and generates descriptions for shortcut nodes, specifying invocation scenarios and replaced action sequences. A high-level action $\tilde{a}$ abstracts a sequence of low-level actions from the basic action set $\mathcal{A}_{\mathrm{basic}}$ . The expanded action space $\mathcal{A}_{\mathrm{evolved}}$ is defined as:

$\mathcal{A}_{\mathrm{evolved} = \mathcal{A}_{\mathrm{basic} \cup \{\tilde{a}\}}$

$\mathcal{A}_{\mathrm{basic}}$ represents the set of atomic actions available to the agent.
$\mathcal{A}_{\mathrm{evolved}}$ represents the enriched action space that includes both basic, low-level actions and the newly composed high-level actions.

The action execution process involves matching parsed elements to stored element nodes via visual embeddings. If element nodes are associated with shortcut nodes, the LLM determines if high-level actions can be executed based on the shortcut node's description and task context. An action execution template is generated, including the sequence of low-level actions and function arguments. If high-level actions are not applicable, the agent defaults to the basic action space.

Experiments were conducted on several benchmarks including AppAgent, DroidTask, and Mobile-Bench. Evaluation metrics included average steps per task, success rate (SR), task time, step time, and LLM token consumption. The implementation utilized GPT-4o as the LLM, LangGraph as the agent platform, Neo4j and Pinecone for memory mechanisms, and ResNet-50 for feature matching.

Results on the AppAgent benchmark showed that incorporating the memory mechanism significantly improves the task SR. The baseline model without memory achieved an SR of only 16.9%, while the chain-structured memory increased the SR to 70.8%. The evolutionary mechanism further reduced the average steps from 9.1 to 5.7, step execution time from 23 to 16 seconds, and token consumption from 9.26k to 4.94k.

Analysis of task difficulty categorized tasks into short, medium, and long based on ground-truth operation steps. AppAgentX demonstrated a clear advantage in task execution time as task complexity increased, with statistically significant improvements (p < 0.05) verified by one-tailed paired t-tests.

Comparisons with other frameworks on Mobile-Bench and DroidTask showed that AppAgentX outperforms AppAgent in both task execution time and accuracy. AppAgentX also outperformed Mobile-Agent-v2 in terms of efficiency across different foundational LLMs (GPT-4, Claude 3.5 Sonnet, and Gemini 1.5 Pro).

Qualitative analysis demonstrated successful completion of tasks across different applications (Gmail and Apple Music) using high-level actions, reducing the reliance on per-step reasoning by the LLM.

The paper identifies limitations in screen content recognition and processing, particularly the detection of interactive elements. Future research should focus on integrating large-scale pre-trained visual models and joint image-text modeling to enhance the adaptability and generalization of GUI Agents.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Wenjia Jiang (3 papers)
Yangyang Zhuang (2 papers)
Chenxi Song (4 papers)
Xu Yang (222 papers)
Chi Zhang (566 papers)
Joey Tianyi Zhou (116 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos