Luban: Building Open-Ended Creative Agents via Autonomous Embodied Verification (2405.15414v1)
Abstract: Building open agents has always been the ultimate goal in AI research, and creative agents are the more enticing. Existing LLM agents excel at long-horizon tasks with well-defined goals (e.g., `mine diamonds' in Minecraft). However, they encounter difficulties on creative tasks with open goals and abstract criteria due to the inability to bridge the gap between them, thus lacking feedback for self-improvement in solving the task. In this work, we introduce autonomous embodied verification techniques for agents to fill the gap, laying the groundwork for creative tasks. Specifically, we propose the Luban agent target creative building tasks in Minecraft, which equips with two-level autonomous embodied verification inspired by human design practices: (1) visual verification of 3D structural speculates, which comes from agent synthesized CAD modeling programs; (2) pragmatic verification of the creation by generating and verifying environment-relevant functionality programs based on the abstract criteria. Extensive multi-dimensional human studies and Elo ratings show that the Luban completes diverse creative building tasks in our proposed benchmark and outperforms other baselines ($33\%$ to $100\%$) in both visualization and pragmatism. Additional demos on the real-world robotic arm show the creation potential of the Luban in the physical world.
- Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
- Video pretraining (vpt): Learning to act by watching unlabeled online videos. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- Groot: Learning to follow instructions by watching gameplay videos. In The Twelfth International Conference on Learning Representations, 2024.
- Blender Online Community. Blender - a 3d modelling and rendering package, 2018.
- A.E. Elo. The USCF Rating System: Its Development, Theory, and Applications. United States Chess Federation, 1966.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
- Geometric deep learning for computer-aided design: A survey, 2024.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Instruct2act: Mapping multi-modality instructions to robotic arm actions with large language model, 2024.
- Inner monologue: Embodied reasoning through planning with language models. In 6th Annual Conference on Robot Learning, 2022.
- Holodiffusion: Training a 3D diffusion model using 2D images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023.
- 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4), jul 2023.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Free2cad: Parsing freehand drawings into cad commands. ACM Trans. Graph. (Proceedings of SIGGRAPH 2022), 41(4):93:1–93:16, 2022.
- Advances in 3d generation: A survey, 2024.
- Code as policies: Language model programs for embodied control. In Workshop on Language and Robotics at CoRL 2022, 2022.
- Steve-1: A generative model for text-to-behavior in minecraft (abridged version). In NeurIPS 2023 Workshop on Goal-Conditioned Reinforcement Learning, 2023.
- Mcu: A task-centric framework for open-ended agent evaluation in minecraft. In Second Agent Learning in Open-Endedness Workshop, 2023.
- A comprehensive survey on 3d content generation, 2024.
- Freecad for osh automated documentation, February 2023.
- Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM, 65(1):99–106, dec 2021.
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- Self-driven grounding: Large language model agents with automatical language-aligned skill learning. ArXiv, abs/2309.01352, 2023.
- Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2023.
- PrismarineJS. Mineflayer: Create minecraft bots with a powerful, stable, and high level javascript api, also usable from python. https://github.com/PrismarineJS/mineflayer, 2013.
- Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
- 3d-GPT: Procedural 3d modeling with large language models, 2024.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024.
- Describe, explain, plan and select: Interactive planning with llms enables open-world multi-task agents. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv: 2311.05997, 2023.
- Hierarchical neural coding for controllable cad model generation. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Creative agents: Empowering agents with imagination for creative tasks, 2023.
- Cadparser: A learning approach of sequence modeling for b-rep cad. In Edith Elkind, editor, Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 1804–1812. International Joint Conferences on Artificial Intelligence Organization, 8 2023. Main Track.
- Ghost in the minecraft: Hierarchical agents for minecraft via large language models with text-based knowledge and memory, 2024.