Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AppAgent v2: Advanced Agent for Flexible Mobile Interactions (2408.11824v3)

Published 5 Aug 2024 in cs.HC and cs.AI

Abstract: With the advancement of Multimodal LLMs (MLLM), LLM-driven visual agents are increasingly impacting software interfaces, particularly those with graphical user interfaces. This work introduces a novel LLM-based multimodal agent framework for mobile devices. This framework, capable of navigating mobile devices, emulates human-like interactions. Our agent constructs a flexible action space that enhances adaptability across various applications including parser, text and vision descriptions. The agent operates through two main phases: exploration and deployment. During the exploration phase, functionalities of user interface elements are documented either through agent-driven or manual explorations into a customized structured knowledge base. In the deployment phase, RAG technology enables efficient retrieval and update from this knowledge base, thereby empowering the agent to perform tasks effectively and accurately. This includes performing complex, multi-step operations across various applications, thereby demonstrating the framework's adaptability and precision in handling customized task workflows. Our experimental results across various benchmarks demonstrate the framework's superior performance, confirming its effectiveness in real-world scenarios. Our code will be open source soon.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yanda Li (11 papers)
  2. Chi Zhang (566 papers)
  3. Wanqi Yang (16 papers)
  4. Bin Fu (74 papers)
  5. Pei Cheng (11 papers)
  6. Xin Chen (456 papers)
  7. Ling Chen (144 papers)
  8. Yunchao Wei (151 papers)
Citations (2)