Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents (2407.17490v1)

Published 3 Jul 2024 in cs.HC, cs.AI, and cs.MM

Abstract: AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents. Their capabilities of completing complex tasks by directly interacting with the graphical user interface (GUI) on mobile devices are trained and evaluated with the proposed dataset. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels. Unlike existing mobile device-control datasets, e.g., MoTIF, AitW, etc., AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions, each averaging 13 steps with stepwise GUI-action chains. We develop this dataset from a more instructive and detailed perspective, complementing the general settings of existing datasets. Additionally, we develop a baseline model SPHINX Agent and compare its performance across state-of-the-art agents trained on other datasets. To facilitate further research, we open-source our dataset, models, and relevant evaluation tools. The project is available at https://yuxiangchai.github.io/AMEX/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments. arXiv preprint arXiv:2104.08560, 2021.
  3. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024.
  4. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  5. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024.
  6. Tinghe Ding. Mobileagent: enhancing mobile control via human-machine interaction and sop integration. arXiv preprint arXiv:2401.04124, 2024.
  7. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935, 2024.
  8. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  9. Understanding html with large language models. arXiv preprint arXiv:2210.03945, 2022.
  10. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  11. Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems, 36, 2024.
  12. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  13. Multimodal foundation models: From specialists to general-purpose assistants. Foundations and Trends® in Computer Graphics and Vision, 16(1-2):1–214, 2024.
  14. Mapping natural language instructions to mobile ui action sequences. arXiv preprint arXiv:2005.03776, 2020.
  15. Reinforcement learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802, 2018.
  16. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  17. Comprehensive cognitive llm agent for smartphone gui automation. arXiv preprint arXiv:2402.11941, 2024.
  18. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  19. Android in the wild: A large-scale dataset for android device control, 2023.
  20. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR, 2017.
  21. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
  22. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  23. Ugif: Ui grounded instruction following. arXiv preprint arXiv:2211.07615, 2022.
  24. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  25. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16133–16142, 2023.
  26. Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024.
  27. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
  28. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023.
  29. Ferret-ui: Grounded mobile ui understanding with multimodal llms. arXiv preprint arXiv:2404.05719, 2024.
  30. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436, 2023.
  31. Ufo: A ui-focused agent for windows os interaction. arXiv preprint arXiv:2402.07939, 2024.
  32. Android in the zoo: Chain-of-action-thought for gui agents, 2024.
  33. Screen recognition: Creating accessibility metadata for mobile applications from pixels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systautoems, pages 1–15, 2021.
  34. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
  35. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yuxiang Chai (7 papers)
  2. Siyuan Huang (123 papers)
  3. Yazhe Niu (16 papers)
  4. Han Xiao (104 papers)
  5. Liang Liu (237 papers)
  6. Dingyu Zhang (1 paper)
  7. Peng Gao (401 papers)
  8. Shuai Ren (19 papers)
  9. Hongsheng Li (340 papers)
Citations (10)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com