ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting (2410.17856v2)
Abstract: Vision-LLMs (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning. A common solution is building hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language. However, language suffers from the inability to communicate detailed spatial information. We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2. Our method unlocks the potential of VLMs, enabling them to tackle complex tasks that demand spatial reasoning. Experiments in Minecraft show that our approach enables agents to achieve previously unattainable tasks, with a $\mathbf{76}\%$ absolute improvement in open-world interaction performance. Codes and demos are now available on the project page: https://craftjarvis.github.io/ROCKET-1.
- Gpt-4 technical report. 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
- Hindsight experience replay. ArXiv, abs/1707.01495, 2017. URL https://api.semanticscholar.org/CorpusID:3532908.
- Video pretraining (vpt): Learning to act by watching unlabeled online videos. ArXiv, abs/2206.11795, 2022. URL https://api.semanticscholar.org/CorpusID:249953673.
- Rt-1: Robotics transformer for real-world control at scale. ArXiv, abs/2212.06817, 2022. URL https://api.semanticscholar.org/CorpusID:254591260.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
- Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13734–13744, 2023a. URL https://api.semanticscholar.org/CorpusID:256194112.
- Groot: Learning to follow instructions by watching gameplay videos. In The Twelfth International Conference on Learning Representations, 2023b.
- GROOT-1.5: Learning to follow multi-modal instructions from weak supervision. In Multi-modal Foundation Model meets Embodied AI Workshop @ ICML2024, 2024. URL https://openreview.net/forum?id=zxdi4Kdfjq.
- Exploring large language model based intelligent agents: Definitions, methods, and prospects. arXiv preprint arXiv:2401.03428, 2024.
- Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Jan 2019. 10.18653/v1/p19-1285. URL http://dx.doi.org/10.18653/v1/p19-1285.
- Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models, 2024. URL https://arxiv.org/abs/2409.17146.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. ArXiv, abs/2206.08853, 2022. URL https://api.semanticscholar.org/CorpusID:249848263.
- Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977, 2023.
- Minerl: A large-scale dataset of minecraft demonstrations. In International Joint Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:199000710.
- An embodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871, 2023.
- Bc-z: Zero-shot task generalization with robotic imitation learning. ArXiv, abs/2202.02005, 2022. URL https://api.semanticscholar.org/CorpusID:237257594.
- Segment anything. ArXiv, abs/2304.02643, 2023. URL https://api.semanticscholar.org/CorpusID:257952310.
- Steve-1: A generative model for text-to-behavior in minecraft. ArXiv, abs/2306.00937, 2023. URL https://api.semanticscholar.org/CorpusID:258999563.
- Mcu: A task-centric framework for open-ended agent evaluation in minecraft. arXiv preprint arXiv:2310.08367, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
- Zson: Zero-shot object-goal navigation using multimodal goal embeddings. ArXiv, abs/2206.12403, 2022. URL https://api.semanticscholar.org/CorpusID:250048645.
- Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024.
- Mp5: A multi-modal open-ended embodied system in minecraft via active perception. arXiv preprint arXiv:2312.07472, 2023.
- Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. URL https://arxiv.org/abs/2408.00714.
- Cliport: What and where pathways for robotic manipulation. In Conference on robot learning, pages 894–906. PMLR, 2022.
- Open-world object manipulation using pre-trained vision-language models. ArXiv, abs/2303.00905, 2023. URL https://api.semanticscholar.org/CorpusID:257280290.
- Rt-sketch: Goal-conditioned imitation learning from hand-drawn sketches. 2024.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Voyager: An open-ended embodied agent with large language models. ArXiv, abs/2305.16291, 2023a. URL https://api.semanticscholar.org/CorpusID:258887849.
- Describe, explain, plan and select: interactive planning with llms enables open-world multi-task agents. Advances in Neural Information Processing Systems, 36, 2023b.
- Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997, 2023c.
- Omnijarvis: Unified vision-language-action tokenization enables open-world instruction following agents. arXiv preprint arXiv:2407.00114, 2024a.
- Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation. arXiv preprint arXiv:2403.05313, 2024b.
- Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. ArXiv, abs/2303.16563, 2023. URL https://api.semanticscholar.org/CorpusID:257805102.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, October 2023.
- Steve-eye: Equipping llm-based embodied agents with visual perception in open worlds. In The Twelfth International Conference on Learning Representations, 2023.
- Minedreamer: Learning to follow instructions via chain-of-imagination for simulated-world control. arXiv preprint arXiv:2403.12037, 2024.
- Shaofei Cai (17 papers)
- Zihao Wang (216 papers)
- Kewei Lian (3 papers)
- Zhancun Mu (6 papers)
- Xiaojian Ma (52 papers)
- Anji Liu (35 papers)
- Yitao Liang (53 papers)