Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation (2403.08355v1)

Published 13 Mar 2024 in cs.RO and cs.CV

Abstract: Enabling home-assistant robots to perceive and manipulate a diverse range of 3D objects based on human language instructions is a pivotal challenge. Prior research has predominantly focused on simplistic and task-oriented instructions, i.e., "Slide the top drawer open". However, many real-world tasks demand intricate multi-step reasoning, and without human instructions, these will become extremely difficult for robot manipulation. To address these challenges, we introduce a comprehensive benchmark, NrVLM, comprising 15 distinct manipulation tasks, containing over 4500 episodes meticulously annotated with fine-grained language instructions. We split the long-term task process into several steps, with each step having a natural language instruction. Moreover, we propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions. Specifically, we first identify the instruction to execute, taking into account visual observations and the end-effector's current state. Subsequently, our approach facilitates explicit learning through action-prompts and perception-prompts to promote manipulation-aware cross-modality alignment. Leveraging both visual observations and linguistic guidance, our model outputs a sequence of actionable predictions for manipulation, including contact points and end-effector poses. We evaluate our method and baselines using the proposed benchmark NrVLM. The experimental results demonstrate the effectiveness of our approach. For additional details, please refer to https://sites.google.com/view/naturalvlm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. A statistical approach to machine translation. Computational linguistics, 16(2):79–85, 1990.
  2. Manipulathor: A framework for visual object manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4497–4506, 2021.
  3. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  4. Instruct2act: Mapping multi-modality instructions to robotic actions with large language model. arXiv preprint arXiv:2305.11176, 2023.
  5. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  6. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
  7. Pyrep: Bringing v-rep to deep robot learning. arXiv preprint arXiv:1906.11176, 2019.
  8. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020.
  9. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  10. Vima: General robot manipulation with multimodal prompts. In NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
  11. Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation. arXiv preprint arXiv:2305.18898, 2023.
  12. Edward Johns. Coarse-to-fine imitation learning: Robot manipulation from a single demonstration. In 2021 IEEE international conference on robotics and automation (ICRA), pages 4613–4619. IEEE, 2021.
  13. Adapt: Vision-language navigation with modality-aligned action prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15396–15406, 2022.
  14. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  15. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021.
  16. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 909–918, 2019.
  17. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  18. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
  19. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  20. V-rep: A versatile and scalable robot simulation framework. In 2013 IEEE/RSJ international conference on intelligent robots and systems, pages 1321–1326. IEEE, 2013.
  21. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
  22. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
  23. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020.
  24. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, pages 477–490. PMLR, 2022.
  25. Fabricflownet: Bimanual cloth manipulation with a flow-based policy. In Conference on Robot Learning, pages 192–202. PMLR, 2022.
  26. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
  27. Invigorate: Interactive visual grounding and grasping in clutter. arXiv preprint arXiv:2108.11092, 2021.
  28. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  29. Vlmbench: A compositional benchmark for vision-and-language manipulation. Advances in Neural Information Processing Systems, 35:665–678, 2022.
  30. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ran Xu (89 papers)
  2. Yan Shen (30 papers)
  3. Xiaoqi Li (77 papers)
  4. Ruihai Wu (28 papers)
  5. Hao Dong (175 papers)
Citations (5)