Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SAGE: Bridging Semantic and Actionable Parts for GEneralizable Manipulation of Articulated Objects (2312.01307v2)

Published 3 Dec 2023 in cs.RO and cs.CV

Abstract: To interact with daily-life articulated objects of diverse structures and functionalities, understanding the object parts plays a central role in both user instruction comprehension and task execution. However, the possible discordance between the semantic meaning and physics functionalities of the parts poses a challenge for designing a general system. To address this problem, we propose SAGE, a novel framework that bridges semantic and actionable parts of articulated objects to achieve generalizable manipulation under natural language instructions. More concretely, given an articulated object, we first observe all the semantic parts on it, conditioned on which an instruction interpreter proposes possible action programs that concretize the natural language instruction. Then, a part-grounding module maps the semantic parts into so-called Generalizable Actionable Parts (GAParts), which inherently carry information about part motion. End-effector trajectories are predicted on the GAParts, which, together with the action program, form an executable policy. Additionally, an interactive feedback module is incorporated to respond to failures, which closes the loop and increases the robustness of the overall framework. Key to the success of our framework is the joint proposal and knowledge fusion between a large vision-LLM (VLM) and a small domain-specific model for both context comprehension and part perception, with the former providing general intuitions and the latter serving as expert facts. Both simulation and real-robot experiments show our effectiveness in handling a large variety of articulated objects with diverse language-instructed goals.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Grounding language to autonomously-acquired skills via goal generation. In ICLR 2021-Ninth International Conference on Learning Representation, 2021.
  2. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
  3. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2023.
  4. Whole-body motion planning for manipulation of articulated objects. In 2013 IEEE International Conference on Robotics and Automation, pages 1656–1662. IEEE, 2013.
  5. End-to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 4693–4700, 2018.
  6. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
  7. Flowbot3d: Learning 3d articulation flow to manipulate articulated objects. arXiv preprint arXiv:2205.04382, 2022.
  8. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–11453, 2020.
  9. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  10. Act the part: Learning interaction strategies for articulated object part discovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15752–15761, 2021.
  11. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022.
  12. kpam-sc: Generalizable manipulation planning using keypoint affordance and shape completion. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6527–6533. IEEE, 2021.
  13. Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations, 2023.
  14. Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. arXiv preprint arXiv:2211.05272, 2022.
  15. End-to-end affordance learning for robotic manipulation. arXiv preprint arXiv:2209.12941, 2022.
  16. Why generalization in rl is difficult: Epistemic pomdps and implicit partial observability. Advances in Neural Information Processing Systems, 34:25502–25515, 2021.
  17. Arnold: A benchmark for language-grounded task learning with continuous states in realistic 3d scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  18. Rgb matters: Learning 7-dof grasp poses on monocular rgbd images. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13459–13466. IEEE, 2021.
  19. Zero-shot task adaptation using natural language. CoRR, abs/2106.02972, 2021.
  20. On the binding problem in artificial neural networks. arXiv preprint arXiv:2012.05208, 2020.
  21. Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659, 2023.
  22. Hierarchical decision making by generating and following natural language instructions. Advances in neural information processing systems, 32, 2019.
  23. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022.
  24. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  25. Inner monologue: Embodied reasoning through planning with language models. In 6th Annual Conference on Robot Learning, 2022.
  26. Pulling open novel doors and drawers with equilibrium point control. In 2009 9th ieee-ras international conference on humanoid robots, pages 498–505. IEEE, 2009.
  27. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  28. Pointgroup: Dual-set point grouping for 3d instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4867–4876, 2020.
  29. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2022.
  30. Segment anything. arXiv:2304.02643, 2023.
  31. A survey of generalisation in deep reinforcement learning. arXiv preprint arXiv:2111.09794, 2021.
  32. Code as policies: Language model programs for embodied control. In arXiv preprint arXiv:2209.07753, 2022.
  33. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.
  34. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  35. Autogpart: Intermediate supervision search for generalizable 3d part segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11624–11634, 2022.
  36. What matters in language conditioned robotic imitation learning. arXiv preprint arXiv:2204.06252, 2022.
  37. Where2act: From pixels to actions for articulated 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6813–6823, 2021.
  38. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. arXiv preprint arXiv:2107.14483, 2021.
  39. OpenAI. Gpt-4 technical report, 2023.
  40. Dinov2: Learning robust visual features without supervision, 2023.
  41. High-level control of a mobile manipulator for door opening. In Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000)(Cat. No. 00CH37113), volume 3, pages 2333–2338. IEEE, 2000.
  42. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
  43. Concept2robot: Learning manipulation concepts from instructions and human demonstrations. The International Journal of Robotics Research, 40(12-14):1419–1434, 2021.
  44. Learning category-level generalizable object manipulation policy via generative adversarial self-imitation learning from demonstrations. arXiv preprint arXiv:2203.02107, 2022.
  45. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
  46. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
  47. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022.
  48. Language-conditioned imitation learning for robot manipulation tasks. Advances in Neural Information Processing Systems, 33:13139–13150, 2020.
  49. Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13438–13444. IEEE, 2021.
  50. Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(04):376–380, 1991.
  51. Doorgym: A scalable door opening environment and baseline agent. arXiv preprint arXiv:1908.01887, 2019.
  52. Softgroup for 3d instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2708–2717, 2022.
  53. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. arXiv preprint arXiv:2304.00464, 2023.
  54. Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions. In European Conference on Computer Vision, pages 90–107. Springer, 2022.
  55. VAT-mart: Learning visual action trajectory proposals for manipulating 3d ARTiculated objects. In International Conference on Learning Representations, 2022.
  56. Sapien: A simulated part-based interactive environment. In CVPR, 2020.
  57. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. arXiv preprint arXiv:2303.00938, 2023.
  58. Universal manipulation policy network for articulated objects. IEEE Robotics and Automation Letters, 7(2):2447–2454, 2022.
  59. Adagrasp: Learning an adaptive gripper-aware grasping policy. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4620–4626. IEEE, 2021.
  60. Make a donut: Language-guided hierarchical emd-space planning for zero-shot deformable object manipulation, 2023.
  61. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021.
  62. Distributed cognition, representation, and affordance. Pragmatics & Cognition, 14(2):333–341, 2006.
  63. Dualafford: Learning collaborative visual affordance for dual-gripper object manipulation. arXiv preprint arXiv:2207.01971, 2022.
Citations (3)

Summary

  • The paper introduces SAGE, a framework that integrates natural language interpretation with robotic manipulation to achieve generalizable object interactions.
  • It employs large language models and visual context parsing to translate instructions into semantic action programs and ground them to physical parts.
  • Experimental results demonstrate that SAGE outperforms baselines in robustness and adaptability across diverse object categories and tasks.

Overview of SAGE

The presented framework, SAGE, is an innovative approach that enhances robotic manipulation of articulated objects under the guidance of language instructions. The poignant challenge addressed by this framework is the real-world variability and complexity of object structures and functionalities, combined with the diverse goals dictated by language-based tasks. To navigate these complexities, SAGE fuses the semantic interpretation of objects with the physical execution of tasks, thus enabling robots to carry out a wide array of manipulations across different object categories as indicated by natural language commands.

Semantic and Actionable Parts Bridging

At the core of SAGE is its capacity to interpret language instructions not simply as directives but as complex, actionable programs. This involves using LLMs to process natural language and translating it into a series of semantic actions that match different parts of an object. For example, the instruction "Turn on the blender" is translated into an action program involving the semantic part that functions as the "button" and the corresponding physical motion needed to activate it. Enhanced scene understanding is attained by introducing a visual context parser that generates descriptions both rich in content and accurate in terms of interaction-related facts. This unification of semantically rich generalist Visual-LLMs (VLMs) and domain-specialist action programs yields a more effective translation from instruction to action.

Part Grounding and Actionable Movements

Following the parsing of instructions, the framework grounds these semantic parts to their physical counterparts, forming what are termed "Generalizable Actionable Parts" (GAParts). These parts are both cross-category and executable. An interactive feedback module is integrated to manage failures by re-evaluating and adjusting actions in response to environmental uncertainties or execution errors. Consequently, this feature substantially increases the robustness and adaptability of the robotic manipulation.

Experimental Validation and Contribution

The effectiveness of SAGE is demonstrated through extensive experiments—conducted both in simulated environments and with real robots—showing the framework's ability to handle a large variety of objects and respond to a diverse set of language instructions. Notably, the framework outperformed other baselines on challenging tasks and showcased its superior capacity for generalization beyond specific object categories and tasks. The contributions of this work are highlighted as follows:

  • The innovation of seamlessly integrating semantic understanding with actionable parts for robot manipulation.
  • The utilization of both general-purposed and domain-specific models to provide detailed scene and part interpretations for manipulation.
  • The framework's broad generalizability demonstrated across multiple object types and language instructions.
  • A new benchmark is established for future assessments of language-guided object manipulation in realistic scenarios.

Importantly, the authors mention that additional details and demonstrations can be found on a dedicated project webpage.