Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AppAgent: Multimodal Agents as Smartphone Users (2312.13771v2)

Published 21 Dec 2023 in cs.CV
AppAgent: Multimodal Agents as Smartphone Users

Abstract: Recent advancements in LLMs have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.

Introduction

The integration of AI into our daily lives has seen a new development with the creation of intelligent agents that can operate smartphone applications as humans do. Leveraging advances in LLMs, which have greatly expanded the capabilities of AI in understanding and generating human language, a new framework has been presented for a multimodal agent. This agent operates directly through a smartphone's graphical user interface (GUI), engaging in typical user actions such as tapping and swiping.

Methodological Insights

The agent's framework is two-fold, comprising an exploration phase and a deployment phase. During the exploration phase, the agent learns app functionalities either autonomously, through trial and error, or by observing human demonstrations. Information from these interactions is gathered into a document, enriching the agent's knowledge base. In autonomous learning, the agent focuses on elements crucial to app operation and avoids unrelated content like advertisements.

In the deployment phase, the agent employs this knowledge to perform complex tasks. It interprets screenshots of the current app state and references its knowledge base to make informed decisions and execute appropriate actions. The agent's understanding of tasks is computed step by step, where it assesses its surroundings, theorizes actions, takes necessary steps, and summarizes its activities for memory retention.

Experimental Evaluation

The efficacy of the agent was tested on 50 tasks across 10 different smartphone applications, demonstrating its proficiency in diverse applications such as social media, email, and image editing. Design choices within the framework were assessed through specific metrics like success rate, reward scores based on proximity to task completion, and the average number of steps to complete tasks. The findings showed that the custom-developed action space and the documents generated from observing human demonstrations greatly enhanced the agent's performance compared to the raw action API.

Vision Capabilities and Case Study

The agent's capability to interpret and manipulate visual elements was examined through a case paper involving Adobe Lightroom, an image-editing application. The tasks involved fixing images with visual issues, such as low contrast or overexposure. User studies ranked the editing results, and it was found that methods utilizing documents, especially those generated by observing human demonstrations, yielded comparable results to manually crafted documentation.

Conclusion and Future Directions

This multimodal agent framework presents a significant step in enabling AI to interact with smartphone applications in a more human-like and accessible manner, bypassing the need for system backend access. The learning method embraced by the agent, encapsulated in both autonomous interaction and the observation of human behavior, enables rapid adaptation to new apps. Going forward, the ability to support advanced control like multi-touch is a potential area for future research to address current limitations and expand the agent's range of applicability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691.
  2. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818.
  3. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817.
  4. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  5. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074.
  6. Multimodal web navigation with instruction-finetuned foundation models.
  7. A real-world webagent with planning, long context understanding, and program synthesis.
  8. Chartllama: A multimodal llm for chart understanding and generation.
  9. Metagpt: Meta programming for a multi-agent collaborative framework.
  10. Zhiting Hu and Tianmin Shu. 2023. Language models, agent models, and world models: The law for machine reasoning and planning. arXiv preprint arXiv:2312.05230.
  11. Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data.
  12. Improved baselines with visual instruction tuning.
  13. Visual instruction tuning.
  14. AgentBench: Evaluating LLMs as agents. arXiv preprint arXiv: 2308.03688.
  15. OpenAI. 2021. Chatgpt. https://openai.com/research/chatgpt.
  16. OpenAI. 2023. Gpt-4 technical report.
  17. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22.
  18. Communicative agents for software development. arXiv preprint arXiv:2307.07924.
  19. A generalist agent. arXiv preprint arXiv:2205.06175.
  20. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. In Advances in Neural Information Processing Systems.
  21. 3d-gpt: Procedural 3d modeling with large language models. arXiv preprint arXiv:2310.12945.
  22. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  23. Llama: Open and efficient foundation language models.
  24. Llama 2: Open foundation and fine-tuned chat models.
  25. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models.
  26. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864.
  27. Openagents: An open platform for language agents in the wild.
  28. Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658.
  29. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv: 2311.07562.
  30. Auto-gpt for online decision making: Benchmarks and additional opinions.
  31. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421.
  32. The dawn of lmms: Preliminary explorations with gpt-4v(ision). arXiv preprint arXiv: 2309.17421.
  33. ReAct: Synergizing reasoning and acting in language models. In ICLR.
  34. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  35. Judging llm-as-a-judge with mt-bench and chatbot arena.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhao Yang (75 papers)
  2. Jiaxuan Liu (11 papers)
  3. Yucheng Han (9 papers)
  4. Xin Chen (456 papers)
  5. Zebiao Huang (2 papers)
  6. Bin Fu (74 papers)
  7. Gang Yu (114 papers)
  8. Chi Zhang (566 papers)
Citations (109)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com