Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting (2403.03174v3)

Published 5 Mar 2024 in cs.RO and cs.AI
MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting

Abstract: Open-world generalization requires robotic systems to have a profound understanding of the physical world and the user command to solve diverse and complex tasks. While the recent advancement in vision-LLMs (VLMs) has offered unprecedented opportunities to solve open-world problems, how to leverage their capabilities to control robots remains a grand challenge. In this paper, we introduce Marking Open-world Keypoint Affordances (MOKA), an approach that employs VLMs to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. By prompting the pre-trained VLM, our approach utilizes the VLM's commonsense knowledge and concept understanding acquired from broad data sources to predict affordances and generate motions. To facilitate the VLM's reasoning in zero-shot and few-shot manners, we propose a visual prompting technique that annotates marks on images, converting affordance reasoning into a series of visual question-answering problems that are solvable by the VLM. We further explore methods to enhance performance with robot experiences collected by MOKA through in-context learning and policy distillation. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.

MOKA: Bridging Vision-LLMs and Robotic Manipulation through Mark-Based Visual Prompting

Overview

The utilization of Vision-LLMs (VLMs) in robotic manipulation tasks presents a compelling opportunity to address the challenge of open-vocabulary generalization. The incorporation of these models into robotics could drastically extend the capability of robots to perform a wide array of tasks instructed through simple, free-form language. This paper introduces an innovative approach, Marking Open-vocabulary Keypoint Affordances (MOKA), which leverages pre-trained VLMs to predict affordances and generate corresponding motions for a robot to execute tasks described in natural language.

Methodology

MOKA embodies a novel strategy that aligns the predictions of VLMs with robotic actions through a point-based affordance representation, encapsulated in a compact, interpretable form. This methodology facilitates zero-shot generalization to new tasks by prompting the VLM with free-form language descriptions and annotated marks on RGB images, effectively transforming task specifications into visual question-answering challenges the VLM can address.

Hierarchical Prompting Strategy

The framework employs a hierarchical approach enabling high-level task decomposition followed by detailed low-level affordance reasoning. At the high level, the model dissects a task into feasible sub-tasks based on initial observations and language descriptions. Subsequently, for each sub-task, it predicts a set of keypoints and waypoints pertinent for motion execution, adhering to a structured affordance representation defined by the authors.

Mark-Based Visual Prompting

A crucial component of MOKA is its mark-based visual prompting technique, which annotates visual marks on image observations to guide the VLM towards useful visual cues for affordance reasoning. This technique shifts the challenge from direct prediction of continuous values to selecting among multiple choices, significantly aligning with VLMs’ strengths.

Evaluation and Results

MOKA was assessed across various manipulation tasks involving tool use, object rearrangement, and interaction with deformable bodies, showcasing robust performance across different instructions, object arrangements, and task environments. The approach demonstrates remarkable capability in zero-shot settings and shows further improvement when using in-context learning or policy distillation from collected task successes.

Implications and Future Directions

This research underscores the potential of leveraging VLMs for robotic manipulation, paving the path for future explorations in this area. The success of MOKA suggests a scalable strategy for extending robotic capabilities to a broader spectrum of tasks without the need for extensive task-specific programming or training. Furthermore, the ability of MOKA to generate data for policy distillation indicates a promising direction for amalgamating model-based and learning-based approaches in robotics.

Theoretical and Practical Contributions

  • Introduces a point-based affordance representation that effectively translates VLM predictions into robotic actions.
  • Proposes a mark-based visual prompting method, enhancing VLM’s applicability to robotic manipulation tasks, especially in an open-vocabulary context.
  • Demonstrates the utility of pre-trained VLMs in solving diverse manipulation tasks specified by free-form language, achieving state-of-the-art performance.

Future Work

While MOKA marks a significant step forward, the exploration of more complex manipulation tasks, including bimanual coordination and tasks requiring delicate force control, remains open. Further development of VLMs and advancements in visual prompting strategies are critical for bridging remaining gaps between language understanding and physical interaction in robotics.

Conclusion

MOKA offers a promising approach towards enabling robots to understand and execute a wide range of manipulation tasks conveyed through natural language, leveraging the vast knowledge encapsulated in VLMs. This work not only presents a methodological advancement in robotic manipulation but also provides insight into the potential synergies between the fields of natural language processing, computer vision, and robotics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  3. Supervised learning of hidden and non-hidden 0-order affordances and detection in real scenes. In 2012 IEEE international conference on robotics and automation, pages 1732–1739. IEEE, 2012.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Open-vocabulary queryable scene representations for real world planning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11509–11522. IEEE, 2023.
  6. Real-time 3d model-based tracking using edge and keypoint features for robotic manipulation. In 2010 IEEE International Conference on Robotics and Automation, pages 4048–4055. IEEE, 2010.
  7. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017.
  10. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023.
  11. James J Gibson. The ecological approach to visual perception: classic edition. Psychology press, 2014.
  12. Affordance prediction via learned object attributes. In IEEE international conference on robotics and automation (ICRA): Workshop on semantic perception, mapping, and exploration, pages 181–184. Citeseer, 2011.
  13. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  14. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842, 2023.
  15. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022a.
  16. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022b.
  17. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  18. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  19. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  20. Reward design with language models. arXiv preprint arXiv:2303.00001, 2023.
  21. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023a.
  22. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022a.
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  24. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022b.
  25. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
  26. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.
  27. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  28. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
  29. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312, 2017.
  30. Zero-shot reward specification via grounded natural language. In International Conference on Machine Learning, pages 14743–14752. PMLR, 2022.
  31. Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In 2010 IEEE International Conference on Robotics and Automation, pages 2308–2315. IEEE, 2010.
  32. kpam: Keypoint affordances for category-level robotic manipulation. In International Symposium of Robotics Research, 2019. URL https://api.semanticscholar.org/CorpusID:80628296.
  33. Parametrized shape models for clothing. In 2011 IEEE International Conference on Robotics and Automation, pages 4861–4868. IEEE, 2011.
  34. Simple open-vocabulary object detection. In European Conference on Computer Vision, pages 728–755. Springer, 2022.
  35. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023.
  36. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
  37. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
  38. Keto: Learning keypoint representations for tool manipulation, 2019.
  39. Improving language understanding by generative pre-training. 2018.
  40. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  41. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  42. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  43. Grounded sam: Assembling open-world models for diverse visual tasks, 2024.
  44. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. arXiv preprint arXiv:2306.14435, 2023.
  45. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE, 2023.
  46. Open-world object manipulation using pre-trained vision-language models. arXiv preprint arXiv:2303.00905, 2023.
  47. Learning visual object categories for robot affordance prediction. The International Journal of Robotics Research, 29(2-3):174–197, 2010.
  48. Gravity-based robotic cloth folding. In Algorithmic Foundations of Robotics IX, pages 409–424. Springer, 2010.
  49. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
  50. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  51. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023a.
  52. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. ArXiv, abs/2310.11441, 2023b. URL https://api.semanticscholar.org/CorpusID:266149987.
  53. Fine-grained visual prompting. arXiv preprint arXiv:2306.04356, 2023c.
  54. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023d.
  55. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  56. Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647, 2023.
  57. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
  58. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Fangchen Liu (23 papers)
  2. Kuan Fang (30 papers)
  3. Pieter Abbeel (372 papers)
  4. Sergey Levine (531 papers)
Citations (23)