SAGE: Bridging Semantic and Actionable Parts for GEneralizable Manipulation of Articulated Objects (2312.01307v2)
Abstract: To interact with daily-life articulated objects of diverse structures and functionalities, understanding the object parts plays a central role in both user instruction comprehension and task execution. However, the possible discordance between the semantic meaning and physics functionalities of the parts poses a challenge for designing a general system. To address this problem, we propose SAGE, a novel framework that bridges semantic and actionable parts of articulated objects to achieve generalizable manipulation under natural language instructions. More concretely, given an articulated object, we first observe all the semantic parts on it, conditioned on which an instruction interpreter proposes possible action programs that concretize the natural language instruction. Then, a part-grounding module maps the semantic parts into so-called Generalizable Actionable Parts (GAParts), which inherently carry information about part motion. End-effector trajectories are predicted on the GAParts, which, together with the action program, form an executable policy. Additionally, an interactive feedback module is incorporated to respond to failures, which closes the loop and increases the robustness of the overall framework. Key to the success of our framework is the joint proposal and knowledge fusion between a large vision-LLM (VLM) and a small domain-specific model for both context comprehension and part perception, with the former providing general intuitions and the latter serving as expert facts. Both simulation and real-robot experiments show our effectiveness in handling a large variety of articulated objects with diverse language-instructed goals.
- Grounding language to autonomously-acquired skills via goal generation. In ICLR 2021-Ninth International Conference on Learning Representation, 2021.
- Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
- Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2023.
- Whole-body motion planning for manipulation of articulated objects. In 2013 IEEE International Conference on Robotics and Automation, pages 1656–1662. IEEE, 2013.
- End-to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 4693–4700, 2018.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
- Flowbot3d: Learning 3d articulation flow to manipulate articulated objects. arXiv preprint arXiv:2205.04382, 2022.
- Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–11453, 2020.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
- Act the part: Learning interaction strategies for articulated object part discovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15752–15761, 2021.
- An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022.
- kpam-sc: Generalizable manipulation planning using keypoint affordance and shape completion. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6527–6533. IEEE, 2021.
- Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations, 2023.
- Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. arXiv preprint arXiv:2211.05272, 2022.
- End-to-end affordance learning for robotic manipulation. arXiv preprint arXiv:2209.12941, 2022.
- Why generalization in rl is difficult: Epistemic pomdps and implicit partial observability. Advances in Neural Information Processing Systems, 34:25502–25515, 2021.
- Arnold: A benchmark for language-grounded task learning with continuous states in realistic 3d scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Rgb matters: Learning 7-dof grasp poses on monocular rgbd images. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13459–13466. IEEE, 2021.
- Zero-shot task adaptation using natural language. CoRR, abs/2106.02972, 2021.
- On the binding problem in artificial neural networks. arXiv preprint arXiv:2012.05208, 2020.
- Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659, 2023.
- Hierarchical decision making by generating and following natural language instructions. Advances in neural information processing systems, 32, 2019.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022.
- Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
- Inner monologue: Embodied reasoning through planning with language models. In 6th Annual Conference on Robot Learning, 2022.
- Pulling open novel doors and drawers with equilibrium point control. In 2009 9th ieee-ras international conference on humanoid robots, pages 498–505. IEEE, 2009.
- Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
- Pointgroup: Dual-set point grouping for 3d instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4867–4876, 2020.
- Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2022.
- Segment anything. arXiv:2304.02643, 2023.
- A survey of generalisation in deep reinforcement learning. arXiv preprint arXiv:2111.09794, 2021.
- Code as policies: Language model programs for embodied control. In arXiv preprint arXiv:2209.07753, 2022.
- Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Autogpart: Intermediate supervision search for generalizable 3d part segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11624–11634, 2022.
- What matters in language conditioned robotic imitation learning. arXiv preprint arXiv:2204.06252, 2022.
- Where2act: From pixels to actions for articulated 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6813–6823, 2021.
- Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. arXiv preprint arXiv:2107.14483, 2021.
- OpenAI. Gpt-4 technical report, 2023.
- Dinov2: Learning robust visual features without supervision, 2023.
- High-level control of a mobile manipulator for door opening. In Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000)(Cat. No. 00CH37113), volume 3, pages 2333–2338. IEEE, 2000.
- Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
- Concept2robot: Learning manipulation concepts from instructions and human demonstrations. The International Journal of Robotics Research, 40(12-14):1419–1434, 2021.
- Learning category-level generalizable object manipulation policy via generative adversarial self-imitation learning from demonstrations. arXiv preprint arXiv:2203.02107, 2022.
- Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
- Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
- Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022.
- Language-conditioned imitation learning for robot manipulation tasks. Advances in Neural Information Processing Systems, 33:13139–13150, 2020.
- Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13438–13444. IEEE, 2021.
- Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(04):376–380, 1991.
- Doorgym: A scalable door opening environment and baseline agent. arXiv preprint arXiv:1908.01887, 2019.
- Softgroup for 3d instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2708–2717, 2022.
- Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. arXiv preprint arXiv:2304.00464, 2023.
- Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions. In European Conference on Computer Vision, pages 90–107. Springer, 2022.
- VAT-mart: Learning visual action trajectory proposals for manipulating 3d ARTiculated objects. In International Conference on Learning Representations, 2022.
- Sapien: A simulated part-based interactive environment. In CVPR, 2020.
- Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. arXiv preprint arXiv:2303.00938, 2023.
- Universal manipulation policy network for articulated objects. IEEE Robotics and Automation Letters, 7(2):2447–2454, 2022.
- Adagrasp: Learning an adaptive gripper-aware grasping policy. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4620–4626. IEEE, 2021.
- Make a donut: Language-guided hierarchical emd-space planning for zero-shot deformable object manipulation, 2023.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021.
- Distributed cognition, representation, and affordance. Pragmatics & Cognition, 14(2):333–341, 2006.
- Dualafford: Learning collaborative visual affordance for dual-gripper object manipulation. arXiv preprint arXiv:2207.01971, 2022.