Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation (2403.08282v2)
Abstract: Due to the dynamic and unpredictable open-world setting, navigating complex environments in Minecraft poses significant challenges for multi-agent systems. Agents must interact with the environment and coordinate their actions with other agents to achieve common objectives. However, traditional approaches often struggle to efficiently manage inter-agent communication and task distribution, crucial for effective multi-agent navigation. Furthermore, processing and integrating multi-modal information (such as visual, textual, and auditory data) is essential for agents to comprehend their goals and navigate the environment successfully and fully. To address this issue, we design the HAS framework to auto-organize groups of LLM-based agents to complete navigation tasks. In our approach, we devise a hierarchical auto-organizing navigation system, which is characterized by 1) a hierarchical system for multi-agent organization, ensuring centralized planning and decentralized execution; 2) an auto-organizing and intra-communication mechanism, enabling dynamic group adjustment under subtasks; 3) a multi-modal information platform, facilitating multi-modal perception to perform the three navigation tasks with one system. To assess organizational behavior, we design a series of navigation tasks in the Minecraft environment, which includes searching and exploring. We aim to develop embodied organizations that push the boundaries of embodied AI, moving it towards a more human-like organizational structure.
- Playing repeated games with large language models. arXiv preprint, 2023.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Scaling imitation learning in minecraft. arXiv preprint arXiv:2007.02701, 2020.
- Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022.
- Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022.
- Large language models as tool makers. arXiv preprint, 2023.
- Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288, 2023a.
- History aware multimodal transformer for vision-and-language navigation. Advances in neural information processing systems, 34:5834–5847, 2021.
- Learning exploration policies for navigation. In International Conference on Learning Representations, 2018.
- Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848, 2023b.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–10, 2018.
- Episodic memory question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19119–19128, 2022.
- See, hear, explore: Curiosity via audio-visual association. Advances in neural information processing systems, 33:14961–14972, 2020.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Vtnet: Visual transformer network for object goal navigation. In International Conference on Learning Representations, 2020.
- Improving factuality and reasoning in language models through multiagent debate, 2023.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
- Chatllm network: More brains, more intelligence. arXiv preprint, 2023.
- Katja Hofmann. Minecraft as ai playground and laboratory. In Proceedings of the annual symposium on computer-human interaction in play, pp. 1–1, 2019.
- Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
- Attention-guided contrastive role representations for multi-agent reinforcement learning. arXiv preprint arXiv:2312.04819, 2023.
- The malmo platform for artificial intelligence experimentation. In Ijcai, pp. 4246–4247, 2016.
- Renderable neural radiance map for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9099–9108, 2023.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp. 12888–12900. PMLR, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
- Steve-1: A generative model for text-to-behavior in minecraft (abridged version). In NeurIPS 2023 Workshop on Goal-Conditioned Reinforcement Learning, 2023.
- Juewu-mc: Playing minecraft with sample-efficient hierarchical reinforcement learning. arXiv preprint arXiv:2112.04907, 2021.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
- Training socially aligned language models in simulated human society. arXiv preprint arXiv:2305.16960, 2023b.
- Symmetry-aware neural architecture for embodied visual exploration. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17221–17230. IEEE, 2022.
- End-to-end active object tracking via reinforcement learning. In International conference on machine learning, pp. 3286–3295. PMLR, 2018.
- End-to-end active object tracking and its real-world deployment via reinforcement learning. IEEE transactions on pattern analysis and machine intelligence, 42(6):1317–1332, 2019.
- Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093, 2023.
- Large language models play starcraft ii: Benchmarks and a chain of summarization approach. arXiv preprint arXiv:2312.11865, 2023.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
- Roco: Dialectic multi-robot collaboration with large language models. arXiv preprint arXiv:2307.04738, 2023.
- Seihai: A sample-efficient hierarchical ai for the minerl competition. In Distributed Artificial Intelligence: Third International Conference, DAI 2021, Shanghai, China, December 17–18, 2021, Proceedings 3, pp. 38–51. Springer, 2022.
- Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553, 2020.
- Soat: A scene-and object-aware transformer for vision-and-language navigation. Advances in Neural Information Processing Systems, 34:7357–7367, 2021.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv: Arxiv-2303.08774, 2023.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–22, 2023.
- PrismarineJS. Prismarinejs/mineflayer: Create minecraft bots with a powerful, stable, and high level javascript api., 2013. URL https://github.com/PrismarineJS/mineflayer/tree/master.
- Experiential co-learning of software-developing agents. arXiv preprint arXiv:2312.17025, 2023.
- Self-organized group for cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 35:5711–5723, 2022.
- Hierarchical deep q-network from imperfect demonstrations in minecraft. Cognitive Systems Research, 65:74–78, 2021.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. arXiv preprint, 2022.
- Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
- A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023b.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023c.
- Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration. arXiv preprint, 2023d.
- Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023e.
- Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In International Conference on Learning Representations, 2019.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
- Multi-target embodied question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6309–6318, 2019.
- Sound adversarial audio-visual navigation. In International Conference on Learning Representations, 2021.
- Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. arXiv preprint arXiv:2303.16563, 2023.
- Building cooperative embodied agents modularly with large language models. arXiv preprint arXiv:2307.02485, 2023.
- See and think: Embodied agent in virtual environment. arXiv preprint arXiv:2311.15209, 2023.
- Ad-vat+: An asymmetric dueling mechanism for learning and understanding visual active tracking. IEEE transactions on pattern analysis and machine intelligence, 43(5):1467–1482, 2019.
- Towards distraction-robust active visual tracking. In International Conference on Machine Learning, pp. 12782–12792. PMLR, 2021.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
- Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144, 2023b.
- Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066, 2023.
- Zhonghan Zhao (11 papers)
- Kewei Chen (13 papers)
- Dongxu Guo (5 papers)
- Wenhao Chai (50 papers)
- Tian Ye (65 papers)
- Yanting Zhang (26 papers)
- Gaoang Wang (68 papers)