Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant (2410.18603v1)

Published 24 Oct 2024 in cs.AI and cs.RO
AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant

Abstract: Digital agents capable of automating complex computer tasks have attracted considerable attention due to their immense potential to enhance human-computer interaction. However, existing agent methods exhibit deficiencies in their generalization and specialization capabilities, especially in handling open-ended computer tasks in real-world environments. Inspired by the rich functionality of the App store, we present AgentStore, a scalable platform designed to dynamically integrate heterogeneous agents for automating computer tasks. AgentStore empowers users to integrate third-party agents, allowing the system to continuously enrich its capabilities and adapt to rapidly evolving operating systems. Additionally, we propose a novel core \textbf{MetaAgent} with the \textbf{AgentToken} strategy to efficiently manage diverse agents and utilize their specialized and generalist abilities for both domain-specific and system-wide tasks. Extensive experiments on three challenging benchmarks demonstrate that AgentStore surpasses the limitations of previous systems with narrow capabilities, particularly achieving a significant improvement from 11.21\% to 23.85\% on the OSWorld benchmark, more than doubling the previous results. Comprehensive quantitative and qualitative results further demonstrate AgentStore's ability to enhance agent systems in both generalization and specialization, underscoring its potential for developing the specialized generalist computer assistant. All our codes will be made publicly available in https://chengyou-jia.github.io/AgentStore-Home.

Insights on AgentStore: A Scalable Platform for Heterogeneous Agent Integration

The authors present AgentStore, a platform designed to integrate heterogeneous agents for automating complex tasks across operating systems. The substantial improvement in performance on the OSWorld benchmark, where AgentStore achieved a success rate of 23.85% compared to the previous best of 11.21%, highlights the efficacy of this approach. AgentStore's development is driven by the limitations present in existing agent methodologies, particularly their struggles with generalization and specialization when confronted with open-ended tasks in real-world computing environments. The concept draws inspiration from the App Store's model for integrating diverse functionalities into a cohesive system.

Key Components and Methodology

AgentStore is characterized by its architecture, which comprises three central components: AgentPool, AgentEnroll, and MetaAgent. The AgentPool houses feature-specific agents, while AgentEnroll offers a standardized protocol for incorporating new agents into the system. MetaAgent serves as the hub for task management, employing a novel AgentToken strategy for efficient coordination of these agents.

  1. AgentToken Strategy: This innovation is pivotal in the MetaAgent's ability to dynamically handle and route tasks to the appropriate agent from an expanding catalog. AgentToken assignments enable MetaAgent to discern which agent is most suitable for a given task or how multiple agents might collaborate effectively. This tokenization method allows MetaAgent to predict the required agent with high accuracy, avoiding the complexities of retraining and lengthy contexts.
  2. Training with SELF-INSTRUCT: The authors propose an automated self-instruct mechanism to generate training data for fine-tuning AgentTokens, thereby reducing reliance on pre-collected datasets. This automated process, leveraging BERTScore to refine generated outputs for quality and diversity, demonstrates efficiency in scaling AgentStore's capabilities.
  3. Practical Implementation: The application of AgentStore within OSWorld demonstrates its ability to execute tasks that range from specialized operations, such as modifying VLC recording settings, to more integrated procedures encompassing multi-agent collaboration.

Implications and Future Directions

AgentStore's scalable integration of agents suggests significant implications for developing "specialized generalists," AI systems that capably perform specific tasks while remaining adaptable to broader challenges. This flexibility is critical as operating systems and associated applications continue to evolve, demanding agents capable of addressing novel and increasingly intricate tasks.

The concept of dynamically integrating diverse agents opens avenues for future exploration in AI, particularly in enhancing the robustness and comprehensiveness of digital assistants. This could include expanding AgentStore to incorporate even more heterogeneous agents, potentially improving its ability to handle complex, multi-step, and cross-application tasks. Moreover, the implementation of the AgentToken strategy in a wider variety of AI applications might offer new insights into efficient agent interaction models.

Overall, the authors contribute a forward-thinking approach to addressing the limitations in current digital agents, offering a scalable framework that leverages the specialized capabilities of individual agents while maintaining general applicability across tasks. AgentStore stands as a promising development towards realizing more capable and versatile AI-driven automation systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. open-interpreter, 2024. URL https://github.com/OpenInterpreter/open-interpreter.
  2. Hash_rc6—variable length hash algorithm using rc6. In 2015 International Conference on Advances in Computer Engineering and Applications, pp.  450–456. IEEE, 2015.
  3. An expert is worth one token: Synergizing multiple expert llms as generalist via expert token routing. arXiv preprint arXiv:2403.16854, 2024.
  4. Harrison Chase. LangChain, October 2022. URL https://github.com/langchain-ai/langchain.
  5. Sheetagent: A generalist agent for spreadsheet reasoning and manipulation via large language models. arXiv preprint arXiv:2403.03636, 2024a.
  6. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024b.
  7. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024.
  8. A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  9. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  10. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems, 36, 2024.
  11. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024.
  12. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
  13. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14281–14290, 2024.
  14. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  15. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  16. Camel: Communicative agents for” mind” exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008, 2023.
  17. Sheetcopilot: Bringing software productivity to the next level through large language models. Advances in Neural Information Processing Systems, 36, 2024.
  18. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  19. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pp.  1–22, 2023.
  20. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
  21. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. CoRR, abs/2403.05530, 2024. doi: 10.48550/ARXIV.2403.05530. URL https://doi.org/10.48550/arXiv.2403.05530.
  22. Ai models collapse when trained on recursively generated data. Nature, 631(8022):755–759, July 2024. ISSN 1476-4687. doi: 10.1038/s41586-024-07566-y. URL https://doi.org/10.1038/s41586-024-07566-y.
  23. Corex: Pushing the boundaries of complex reasoning through multi-model collaboration. arXiv preprint arXiv:2310.00280, 2023.
  24. A survey of neural code intelligence: Paradigms, advances and beyond. arXiv preprint arXiv:2403.14734, 2024.
  25. Meta-prompting: Enhancing language models with task-agnostic scaffolding. arXiv preprint arXiv:2401.12954, 2024.
  26. Towards general computer control: A multimodal agent for red dead redemption ii as a case study. arXiv preprint arXiv:2403.03186, 2024.
  27. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  28. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  29. Cogvlm: Visual expert for pretrained language models, 2023b.
  30. Opendevin: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024.
  31. Self-instruct: Aligning language models with self-generated instructions. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023c.
  32. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  33. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
  34. Os-copilot: Towards generalist computer agents with self-improvement, 2024.
  35. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024.
  36. Interactive evolution: A neural-symbolic self-training framework for large language models. arXiv preprint arXiv:2406.11736, 2024.
  37. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023.
  38. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.
  39. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X.
  40. UFO: A UI-Focused Agent for Windows OS Interaction. arXiv preprint arXiv:2402.07939, 2024a.
  41. Towards building specialized generalist ai with system 1 and system 2 fusion. arXiv preprint arXiv:2407.08642, 2024b.
  42. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  43. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Chengyou Jia (17 papers)
  2. Minnan Luo (61 papers)
  3. Zhuohang Dang (12 papers)
  4. Qiushi Sun (26 papers)
  5. Fangzhi Xu (22 papers)
  6. Junlin Hu (2 papers)
  7. Tianbao Xie (22 papers)
  8. Zhiyong Wu (171 papers)
Citations (1)