Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AgentStudio: A Toolkit for Building General Virtual Agents (2403.17918v2)

Published 26 Mar 2024 in cs.AI

Abstract: General virtual agents need to handle multimodal observations, master complex action spaces, and self-improve in dynamic, open-domain environments. However, existing environments are often domain-specific and require complex setups, which limits agent development and evaluation in real-world settings. As a result, current evaluations lack in-depth analyses that decompose fundamental agent capabilities. We introduce AgentStudio, a trinity of environments, tools, and benchmarks to address these issues. AgentStudio provides a lightweight, interactive environment with highly generic observation and action spaces, e.g., video observations and GUI/API actions. It integrates tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos. Based on our environment and tools, we curate an online task suite that benchmarks both GUI interactions and function calling with efficient auto-evaluation. We also reorganize existing datasets and collect new ones using our tools to establish three datasets: GroundUI, IDMBench, and CriticBench. These datasets evaluate fundamental agent abilities, including GUI grounding, learning from videos, and success detection, pointing to the desiderata for robust, general, and open-ended virtual agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022.
  2. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901, 2020.
  3. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. arXiv preprint arXiv:2401.10935, 2024.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  5. Mind2Web: Towards a generalist agent for the web. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  6. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  7. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
  8. Learning to navigate the web. In International Conference on Learning Representations, 2018.
  9. Environment generation for zero-shot compositional reinforcement learning. In Advances in Neural Information Processing Systems, volume 34, 2021.
  10. A real-world WebAgent with planning, long context understanding, and program synthesis. In The Twelfth International Conference on Learning Representations, 2023a.
  11. Understanding HTML with large language models. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023b.
  12. Tool documentation enables zero-shot tool-usage with large language models. arXiv preprint arXiv:2308.00675, 2023.
  13. A data-driven approach for learning to control computers. In International Conference on Machine Learning, pp.  9466–9482. PMLR, 2022.
  14. DOM-Q-NET: Grounded RL on structured language. In International Conference on Learning Representations, 2018.
  15. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553, 2024.
  16. Language models can solve computer tasks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  17. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024.
  18. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  9493–9500. IEEE, 2023.
  19. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018.
  20. AgentBench: Evaluating LLMs as agents. In The Twelfth International Conference on Learning Representations, 2023.
  21. GAIA: A benchmark for general AI assistants. In The Twelfth International Conference on Learning Representations, 2023.
  22. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  23. Screenagent: A vision language model-driven computer control agent. arXiv preprint arXiv:2402.07945, 2024.
  24. OpenAI. Universe. https://openai.com/blog/universe/, 2016.
  25. OpenAI. Function calling and other api updates. https://openai.com/blog/function-calling-and-other-api-updates, 2023a.
  26. OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023b.
  27. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
  28. Android in the wild: A large-scale dataset for Android device control. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  29. World of Bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pp.  3135–3144. PMLR, 2017.
  30. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  31. SIMA Team. Scaling instructable agents across many simulated worlds. 2024.
  32. Towards general computer control: A multimodal agent for red dead redemption ii as a case study. arXiv preprint arXiv:2403.03186, 2024.
  33. AndroidEnv: A reinforcement learning platform for Android. arXiv preprint arXiv:2105.13231, 2021.
  34. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
  35. Os-Copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024.
  36. OpenAgents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634, 2023.
  37. On the tool manipulation capability of open-source large language models. arXiv preprint arXiv:2305.16504, 2023.
  38. WebShop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, 2022.
  39. Ufo: A ui-focused agent for windows os interaction. arXiv preprint arXiv:2402.07939, 2024.
  40. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024a.
  41. Gpt-4v(ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024b.
  42. Synapse: Trajectory-as-exemplar prompting with memory for computer control. In The Twelfth International Conference on Learning Representations, 2023.
  43. WebArena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2023.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com