Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting (2410.17856v2)

Published 23 Oct 2024 in cs.CV and cs.AI

Abstract: Vision-LLMs (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning. A common solution is building hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language. However, language suffers from the inability to communicate detailed spatial information. We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2. Our method unlocks the potential of VLMs, enabling them to tackle complex tasks that demand spatial reasoning. Experiments in Minecraft show that our approach enables agents to achieve previously unattainable tasks, with a $\mathbf{76}\%$ absolute improvement in open-world interaction performance. Codes and demos are now available on the project page: https://craftjarvis.github.io/ROCKET-1.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Gpt-4 technical report. 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  2. Hindsight experience replay. ArXiv, abs/1707.01495, 2017. URL https://api.semanticscholar.org/CorpusID:3532908.
  3. Video pretraining (vpt): Learning to act by watching unlabeled online videos. ArXiv, abs/2206.11795, 2022. URL https://api.semanticscholar.org/CorpusID:249953673.
  4. Rt-1: Robotics transformer for real-world control at scale. ArXiv, abs/2212.06817, 2022. URL https://api.semanticscholar.org/CorpusID:254591260.
  5. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  6. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13734–13744, 2023a. URL https://api.semanticscholar.org/CorpusID:256194112.
  7. Groot: Learning to follow instructions by watching gameplay videos. In The Twelfth International Conference on Learning Representations, 2023b.
  8. GROOT-1.5: Learning to follow multi-modal instructions from weak supervision. In Multi-modal Foundation Model meets Embodied AI Workshop @ ICML2024, 2024. URL https://openreview.net/forum?id=zxdi4Kdfjq.
  9. Exploring large language model based intelligent agents: Definitions, methods, and prospects. arXiv preprint arXiv:2401.03428, 2024.
  10. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Jan 2019. 10.18653/v1/p19-1285. URL http://dx.doi.org/10.18653/v1/p19-1285.
  11. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models, 2024. URL https://arxiv.org/abs/2409.17146.
  12. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  13. Minedojo: Building open-ended embodied agents with internet-scale knowledge. ArXiv, abs/2206.08853, 2022. URL https://api.semanticscholar.org/CorpusID:249848263.
  14. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977, 2023.
  15. Minerl: A large-scale dataset of minecraft demonstrations. In International Joint Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:199000710.
  16. An embodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871, 2023.
  17. Bc-z: Zero-shot task generalization with robotic imitation learning. ArXiv, abs/2202.02005, 2022. URL https://api.semanticscholar.org/CorpusID:237257594.
  18. Segment anything. ArXiv, abs/2304.02643, 2023. URL https://api.semanticscholar.org/CorpusID:257952310.
  19. Steve-1: A generative model for text-to-behavior in minecraft. ArXiv, abs/2306.00937, 2023. URL https://api.semanticscholar.org/CorpusID:258999563.
  20. Mcu: A task-centric framework for open-ended agent evaluation in minecraft. arXiv preprint arXiv:2310.08367, 2023.
  21. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  22. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
  23. Zson: Zero-shot object-goal navigation using multimodal goal embeddings. ArXiv, abs/2206.12403, 2022. URL https://api.semanticscholar.org/CorpusID:250048645.
  24. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024.
  25. Mp5: A multi-modal open-ended embodied system in minecraft via active perception. arXiv preprint arXiv:2312.07472, 2023.
  26. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. URL https://arxiv.org/abs/2408.00714.
  27. Cliport: What and where pathways for robotic manipulation. In Conference on robot learning, pages 894–906. PMLR, 2022.
  28. Open-world object manipulation using pre-trained vision-language models. ArXiv, abs/2303.00905, 2023. URL https://api.semanticscholar.org/CorpusID:257280290.
  29. Rt-sketch: Goal-conditioned imitation learning from hand-drawn sketches. 2024.
  30. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  31. Voyager: An open-ended embodied agent with large language models. ArXiv, abs/2305.16291, 2023a. URL https://api.semanticscholar.org/CorpusID:258887849.
  32. Describe, explain, plan and select: interactive planning with llms enables open-world multi-task agents. Advances in Neural Information Processing Systems, 36, 2023b.
  33. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997, 2023c.
  34. Omnijarvis: Unified vision-language-action tokenization enables open-world instruction following agents. arXiv preprint arXiv:2407.00114, 2024a.
  35. Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation. arXiv preprint arXiv:2403.05313, 2024b.
  36. Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. ArXiv, abs/2303.16563, 2023. URL https://api.semanticscholar.org/CorpusID:257805102.
  37. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, October 2023.
  38. Steve-eye: Equipping llm-based embodied agents with visual perception in open worlds. In The Twelfth International Conference on Learning Representations, 2023.
  39. Minedreamer: Learning to follow instructions via chain-of-imagination for simulated-world control. arXiv preprint arXiv:2403.12037, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Shaofei Cai (17 papers)
  2. Zihao Wang (216 papers)
  3. Kewei Lian (3 papers)
  4. Zhancun Mu (6 papers)
  5. Xiaojian Ma (52 papers)
  6. Anji Liu (35 papers)
  7. Yitao Liang (53 papers)

Summary

An Expert Overview of "ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting"

The paper "ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting" addresses significant challenges in adapting Vision-LLMs (VLMs) to embodied decision-making within open-world environments. The authors propose an innovative approach called visual-temporal context prompting, aimed at enhancing the capability of VLMs to smoothly align low-level observations with high-level decision-making processes. This approach is designed to improve spatial understanding and interaction in complex tasks, particularly in dynamic environments like Minecraft.

Key Contributions

  1. Visual-Temporal Context Prompting: The paper introduces a novel communication protocol integrating visual and temporal cues to bridge the spatial information gap often encountered in language prompts. Unlike language instructions, which struggle to convey detailed spatial relationships, this method uses object segmentation to guide interactions, leveraging past and present observations.
  2. ROCKET-1 Policy Model: ROCKET-1 is developed as a low-level policy model that predicts actions based on enriched observations. It utilizes segmentation masks and real-time object tracking, enhancing the decision-making process with spatial awareness. The model aims to enhance VLMs' reasoning by interpreting segmented object data to facilitate precise action predictions.
  3. Backward Trajectory Relabeling: This method efficiently generates training data by using reverse temporal segmentation, identifying objects of interest across collected trajectories. SAM-2, a state-of-the-art segmentation model, plays a crucial role in this process by ensuring accurate object tracking even in partially observable environments.
  4. Hierarchical Agent Architecture: The architecture combines high-level VLMs for reasoning with the ROCKET-1 policy, iteratively refining interaction capabilities. This integration ensures that the agent can inherit and utilize the comprehensive reasoning capabilities intrinsic to VLMs.

Experimental Validation

The paper evaluates the proposed method using a custom Minecraft Interaction Benchmark, which includes a series of tasks emphasizing object interaction and spatial awareness. ROCKET-1, in conjunction with high-level reasoners, outperforms existing baselines, demonstrating superior efficiency in both short-horizon and long-horizon tasks. Notably, it achieves significant improvements in spatially complex tasks, highlighting the efficacy of visual-temporal context prompting.

Theoretical and Practical Implications

Theoretically, this research introduces a compelling means to enhance VLMs' interaction capabilities, paving the way for more sophisticated reasoning in open-world scenarios. Practically, it provides insights into developing AI systems that can manage and operate within dynamic and partially observable environments, such as autonomous robotics and virtual agents.

Future Developments

Future work could extend this approach to even more complex environments or refine the model's ability to generalize across unseen tasks. Improvements in large-scale deployment of such models could lead to substantial progress in AI-driven exploration and interaction systems.

Overall, this work presents a robust framework for overcoming the spatial communication challenges in embodied AI systems, leveraging advanced segmentation and prompting techniques to enhance VLMs' effectiveness. This approach may serve as a foundation for developing more generalizable and efficient interaction policies in open-world computational environments.

Youtube Logo Streamline Icon: https://streamlinehq.com