Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Foundations and Recent Trends in Multimodal Mobile Agents: A Survey (2411.02006v1)

Published 4 Nov 2024 in cs.AI
Foundations and Recent Trends in Multimodal Mobile Agents: A Survey

Abstract: Mobile agents are essential for automating tasks in complex and dynamic mobile environments. As foundation models evolve, the demands for agents that can adapt in real-time and process multimodal data have grown. This survey provides a comprehensive review of mobile agent technologies, focusing on recent advancements that enhance real-time adaptability and multimodal interaction. Recent evaluation benchmarks have been developed better to capture the static and interactive environments of mobile tasks, offering more accurate assessments of agents' performance. We then categorize these advancements into two main approaches: prompt-based methods, which utilize LLMs for instruction-based task execution, and training-based methods, which fine-tune multimodal models for mobile-specific applications. Additionally, we explore complementary technologies that augment agent performance. By discussing key challenges and outlining future research directions, this survey offers valuable insights for advancing mobile agent technologies. A comprehensive resource list is available at https://github.com/aialt/awesome-mobile-agents

Overview of "Foundations and Recent Trends in Multimodal Mobile Agents: A Survey"

The evolving landscape of multimodal mobile agents, as presented in this comprehensive survey, reflects pivotal advancements in mobile agent technologies. The survey covers a breadth of foundational models and recent trends, underscoring the increased demand for agents that exhibit real-time adaptability and efficient processing of multimodal data. This essay provides a detailed summary and analysis of the research findings, highlighting critical aspects and suggesting potential directions for future inquiry.

Key Technological Advancements

The field of mobile agent research has witnessed transformative developments, primarily categorized into prompt-based methods and training-based methods. Prompt-based methods employ LLMs for instruction-based task execution. Demonstrated systems, such as OmniAct and AppAgent, highlight the capabilities of LLMs like GPT-4 in executing complex tasks through instruction prompting and chain-of-thought (CoT) reasoning. However, scalability and robustness continue to pose challenges.

Conversely, training-based methods focus on the fine-tuning of multimodal models tailored for mobile-specific applications. Examples include LLaVA and its counterparts, which integrate visual and textual inputs to enhance task execution, especially in interface navigation. These paradigms illustrate a significant shift from static rule-based systems to dynamic, adaptable frameworks.

Evaluation Benchmarks

Evaluating mobile agents remains complex, particularly in capturing the dynamic and interactive nature of mobile tasks. Recent benchmarks like AndroidEnv and Mobile-Env provide novel environments to assess agent performance in realistic conditions, measuring adaptability beyond task completion metrics. These platforms address the limitations inherent in traditional static datasets and offer a comprehensive view of agent capabilities in interactive environments.

Components of Mobile Agents

The survey explores four core components underpinning mobile agents: perception, planning, action, and memory. These elements work in synchrony to enable agents to perceive, plan, and execute tasks in dynamic environments. The perception process, for instance, now benefits from multimodal integration, overcoming limitations of earlier methods that struggled with excessive irrelevant information.

Effective planning, categorized into dynamic and static strategies, remains crucial for mobile agents to adapt to environments with fluctuating inputs. Actions executed through GUI interactions, API calls, and collaborations demonstrate the agent's ability to mimic human behavior across diverse tasks. Moreover, memory mechanisms, both short-term and long-term, enhance task execution by allowing agents to retain task-relevant information.

Implications and Future Directions

The surveyed technologies present several implications for the future of mobile agents. The necessity for enhanced security and privacy mechanisms is critical, given the risks associated with open environments. Moreover, improving the adaptability of mobile agents to dynamic settings and fostering multi-agent collaboration are integral areas for continued research.

Future work should explore innovative strategies to bolster agent behavior in rapidly changing environments, employing privacy-preserving techniques to secure sensitive data. Additionally, advancing multi-agent frameworks could enable more efficient task coordination and execution, propelling the practical applicability of mobile agents.

Conclusion

This survey embodies a significant scholarly contribution to the understanding of multimodal mobile agents. The discourse on benchmarks, core components, and methodologies not only sheds light on the current technological landscape but also sets the stage for future innovations. The continuous evolution of mobile agent technologies will undoubtedly reshape the domain, with implications for both practical applications and theoretical development in artificial intelligence research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (96)
  1. Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853.
  2. Galen Andrew and Jianfeng Gao. 2007. Scalable training of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-regularized log-linear models. In Proceedings of the 24th International Conference on Machine Learning, pages 33–40.
  3. Stance detection with bidirectional conditional encoding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 876–885, Austin, Texas. Association for Computational Linguistics.
  4. Screenai: A vision-language model for ui and infographics understanding. arXiv preprint arXiv:2402.04615.
  5. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. arXiv preprint arXiv:2406.11896.
  6. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  7. Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments. arXiv preprint arXiv:2104.08560.
  8. Amex: Android multi-annotation expo dataset for mobile gui agents. arXiv preprint arXiv:2407.17490.
  9. Gui-world: A dataset for gui-oriented multimodal llm-based agents. arXiv preprint arXiv:2406.10819.
  10. Webvln: Vision-and-language navigation on websites. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 1165–1173.
  11. Wei Chen and Zhiyuan Li. 2024. Octopus v2: On-device language model for super agent. arXiv preprint arXiv:2404.01744.
  12. Octo-planner: On-device language model for planner-action agents. arXiv preprint arXiv:2406.18082.
  13. Websrc: A dataset for web-based structural reading comprehension. arXiv preprint arXiv:2101.09465.
  14. Seeclick: Harnessing gui grounding for advanced visual gui agents. ArXiv preprint, abs/2401.10935.
  15. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org.
  16. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology, pages 845–854.
  17. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36.
  18. Training a vision language model as smartphone assistant. arXiv preprint arXiv:2404.08755.
  19. Assistgui: Task-oriented desktop graphical user interface automation. ArXiv preprint, abs/2312.13108.
  20. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640.
  21. Noise reduction and targeted exploration in imitation learning for Abstract Meaning Representation parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1–11, Berlin, Germany. Association for Computational Linguistics.
  22. Pptc benchmark: Evaluating large language models for powerpoint task completion. arXiv preprint arXiv:2311.01767.
  23. A real-world webagent with planning, long context understanding, and program synthesis. ArXiv preprint, abs/2307.12856.
  24. Mary Harper. 2014. Learning from 26 languages: Program management and science in the babel program. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, page 1, Dublin, Ireland. Dublin City University and Association for Computational Linguistics.
  25. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919.
  26. Cogagent: A visual language model for gui agents. ArXiv preprint, abs/2312.08914.
  27. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290.
  28. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553.
  29. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. ArXiv preprint, abs/2401.13649.
  30. Large language models are zero-shot reasoners. ArXiv preprint, abs/2205.11916.
  31. Proceedings of the 1st Workshop on Meta Learning and Its Applications to Natural Language Processing. Association for Computational Linguistics, Online.
  32. Benchmarking mobile device control agents across diverse configurations. arXiv preprint arXiv:2404.16660.
  33. Gang Li and Yang Li. 2022. Spotlight: Mobile ui understanding using vision-language models with a focus. arXiv preprint arXiv:2209.14927.
  34. Appagent v2: Advanced agent for flexible mobile interactions. arXiv preprint arXiv:2408.11824.
  35. Mapping natural language instructions to mobile ui action sequences. arXiv preprint arXiv:2005.03776.
  36. Widget captioning: Generating natural language description for mobile user interface elements. arXiv preprint arXiv:2010.04295.
  37. Vut: Versatile ui transformer for multi-modal multi-task user interface modeling. arXiv preprint arXiv:2112.05692.
  38. Reinforcement learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802.
  39. Visual instruction tuning. ArXiv preprint, abs/2304.08485.
  40. Agentbench: Evaluating llms as agents. ArXiv preprint, abs/2308.03688.
  41. From skepticism to acceptance: Simulating the attitude dynamics toward fake news. arXiv preprint arXiv:2403.09498.
  42. Chatting with gpt-3 for zero-shot human-like mobile automated gui testing. arXiv preprint arXiv:2305.09434.
  43. Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. arXiv preprint arXiv:2406.08451.
  44. Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation. In Findings of the Association for Computational Linguistics ACL 2024, pages 9097–9110.
  45. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  46. Screenagent: A vision language model-driven computer control agent. arXiv preprint arXiv:2402.07945.
  47. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop.
  48. OpenAI. 2023. Chatgpt. https://openai.com/blog/chatgpt/. 1, 2.
  49. OpenAI. 2023. Gpt-4 technical report.
  50. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789.
  51. Mohammad Sadegh Rasooli and Joel R. Tetreault. 2015. Yara parser: A fast and accurate dependency parser. Computing Research Repository, arXiv:1503.06733. Version 2.
  52. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573.
  53. Androidinthewild: A large-scale dataset for android device control. Advances in Neural Information Processing Systems, 36.
  54. Androidinthewild: A large-scale dataset for android device control. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  55. Weblinks: Augmenting web browsers with enhanced link services. In Proceedings of the 3rd Workshop on Human Factors in Hypertext, pages 1–5.
  56. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  57. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR.
  58. Appbuddy: Learning to accomplish tasks in mobile apps via reinforcement learning. In Canadian AI.
  59. Navigating interfaces with ai for enhanced user interaction. ArXiv preprint, abs/2312.11190.
  60. Facilitating multi-role and multi-behavior collaboration of large language models for online job seeking and recruiting. arXiv preprint arXiv:2405.18113.
  61. Harnessing multi-role capabilities of large language models for open-domain question answering. In Proceedings of the ACM on Web Conference 2024, pages 4372–4382.
  62. META-GUI: Towards multi-modal conversational agents on mobile GUI. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6699–6712, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  63. Towards better semantic understanding of mobile interfaces. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5636–5650, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  64. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  65. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288.
  66. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231.
  67. Ugif: Ui grounded instruction following. arXiv preprint arXiv:2211.07615.
  68. Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498–510.
  69. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. arXiv preprint arXiv:2406.01014.
  70. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158.
  71. Mobileagentbench: An efficient and user-friendly benchmark for mobile llm agents. arXiv preprint arXiv:2406.08184.
  72. MOTIF: Contextualized images for complex words to improve human reading. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2468–2477, Marseille, France. European Language Resources Association.
  73. Chain of thought prompting elicits reasoning in large language models. ArXiv preprint, abs/2201.11903.
  74. Empowering llm to use smartphone for intelligent task automation. ArXiv preprint, abs/2308.15272.
  75. Autodroid: Llm-powered task automation in android. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking, pages 543–557.
  76. Droidbot-gpt: Gpt-powered ui automation for android. arXiv preprint arXiv:2304.07061.
  77. Webui: A dataset for enhancing visual ui understanding with web semantics. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–14.
  78. Mobilevlm: A vision-language model for better intra-and inter-ui understanding. arXiv preprint arXiv:2409.14818.
  79. Uied: a hybrid tool for gui element detection. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1655–1659.
  80. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972.
  81. Understanding the weakness of large language model agents within a complex android environment. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6061–6072.
  82. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. ArXiv preprint, abs/2311.07562.
  83. Appagent: Multimodal agents as smartphone users. ArXiv preprint, abs/2312.13771.
  84. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757.
  85. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  86. Ferret-ui: Grounded mobile ui understanding with multimodal llms.
  87. Zhuosheng Zhan and Aston Zhang. 2023. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436.
  88. Ufo: A ui-focused agent for windows os interaction. arXiv preprint arXiv:2402.07939.
  89. Mobile-env: A universal platform for training and evaluation of mobile interaction. arXiv preprint arXiv:2305.08144.
  90. Android in the zoo: Chain-of-action-thought for gui agents. arXiv preprint arXiv:2403.02713.
  91. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. ArXiv preprint, abs/2303.16199.
  92. Screen recognition: Creating accessibility metadata for mobile applications from pixels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15.
  93. Responsible task automation: Empowering large language models as responsible task automators. arXiv preprint arXiv:2306.01242.
  94. Zhuosheng Zhang and Aston Zhang. 2023. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436.
  95. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations.
  96. Webarena: A realistic web environment for building autonomous agents. ArXiv preprint, abs/2307.13854.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Biao Wu (101 papers)
  2. Yanda Li (11 papers)
  3. Meng Fang (100 papers)
  4. Zirui Song (21 papers)
  5. Zhiwei Zhang (75 papers)
  6. Yunchao Wei (151 papers)
  7. Ling Chen (144 papers)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com