A Survey of Embodied Learning for Object-Centric Robotic Manipulation (2408.11537v1)
Abstract: Embodied learning for object-centric robotic manipulation is a rapidly developing and challenging area in embodied AI. It is crucial for advancing next-generation intelligent robots and has garnered significant interest recently. Unlike data-driven machine learning methods, embodied learning focuses on robot learning through physical interaction with the environment and perceptual feedback, making it especially suitable for robotic manipulation. In this paper, we provide a comprehensive survey of the latest advancements in this field and categorize the existing work into three main branches: 1) Embodied perceptual learning, which aims to predict object pose and affordance through various data representations; 2) Embodied policy learning, which focuses on generating optimal robotic decisions using methods such as reinforcement learning and imitation learning; 3) Embodied task-oriented learning, designed to optimize the robot's performance based on the characteristics of different tasks in object grasping and manipulation. In addition, we offer an overview and discussion of public datasets, evaluation metrics, representative applications, current challenges, and potential future research directions. A project associated with this survey has been established at https://github.com/RayYoh/OCRM_survey.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2020.
- A. Vaswani et al., “Attention is all you need,” NeurIPS, vol. 30, 2017.
- J. Devlin et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” in ACL, 2019, pp. 4171–4186.
- A. Gupta et al., “Embodied intelligence via learning and evolution,” Nature Communications, vol. 12, no. 1, p. 5721, 2021.
- N. Roy et al., “From machine learning to robotics: Challenges and opportunities for embodied intelligence,” arXiv:2110.15245, 2021.
- W. X. Zhao et al., “A survey of large language models,” arXiv:2303.18223, 2023.
- B. Mildenhall, P. P. Srinivasan et al., “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020.
- B. Kerbl et al., “3d gaussian splatting for real-time radiance field rendering,” TOG, vol. 42, no. 4, pp. 1–14, 2023.
- T. Gervet et al., “Navigating to objects in the real world,” Science Robotics, vol. 8, no. 79, p. eadf6991, 2023.
- H. Guo et al., “Recent trends in task and motion planning for robotics: A survey,” ACM Computing Surveys, vol. 55, no. 13s, pp. 1–36, 2023.
- G. Du, K. Wang, S. Lian, and K. Zhao, “Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review,” Artificial Intelligence Review, vol. 54, no. 3, pp. 1677–1734, 2021.
- D. Han et al., “A survey on deep reinforcement learning algorithms for robotic manipulation,” Sensors, vol. 23, no. 7, p. 3762, 2023.
- Y. Cong et al., “A comprehensive study of 3-d vision-based robot manipulation,” TCYB, vol. 53, no. 3, pp. 1682–1698, 2021.
- Y. Huang et al., “Recent data sets on object manipulation: A survey,” Big Data, vol. 4, no. 4, pp. 197–216, 2016.
- N. Yamanobe, W. Wan et al., “A brief review of affordance in robotic manipulation research,” Advanced Robotics, vol. 31, no. 19-20, pp. 1086–1101, 2017.
- L. Jin et al., “Robot manipulator control using neural networks: A survey,” Neurocomputing, vol. 285, pp. 23–34, 2018.
- B. Fang et al., “Survey of imitation learning for robotic manipulation,” IJIRA, vol. 3, pp. 362–369, 2019.
- A. Billard and D. Kragic, “Trends and challenges in robot manipulation,” Science, vol. 364, no. 6446, p. eaat8414, 2019.
- K. Kleeberger et al., “A survey on learning-based robotic grasping,” Current Robotics Reports, vol. 1, pp. 239–249, 2020.
- O. Kroemer, S. Niekum, and G. Konidaris, “A review of robot learning for manipulation: Challenges, representations, and algorithms,” JMLR, vol. 22, no. 30, pp. 1–82, 2021.
- F. Zhu, Y. Zhu, V. Lee, X. Liang, and X. Chang, “Deep learning for embodied vision navigation: A survey,” arXiv:2108.04097, 2021.
- J. Cui and J. Trinkle, “Toward next-generation learned robot manipulation,” Science Robotics, vol. 6, no. 54, p. eabd9461, 2021.
- J. Zhu et al., “Challenges and outlook in robotic manipulation of deformable objects,” RAM, vol. 29, no. 3, pp. 67–77, 2022.
- M. Q. Mohammed, L. C. Kwek et al., “Review of learning-based robotic manipulation in cluttered environments,” Sensors, vol. 22, no. 20, p. 7938, 2022.
- M. Suomalainen et al., “A survey of robot manipulation in contact,” RAS, vol. 156, p. 104224, 2022.
- J. Duan et al., “A survey of embodied ai: From simulators to research tasks,” TETCI, vol. 6, no. 2, pp. 230–244, 2022.
- J. Francis et al., “Core challenges in embodied vision-language planning,” JAIR, vol. 74, pp. 459–515, 2022.
- H. Zhang, J. Tang, S. Sun, and X. Lan, “Robotic grasping from classical to modern: A survey,” arXiv:2202.03631, 2022.
- Z. Xie, X. Liang, and C. Roberto, “Learning-based robotic grasping: A review,” Frontiers in Robotics and AI, vol. 10, p. 1038658, 2023.
- H. Tian et al., “Data-driven robotic visual grasping detection for unknown objects: A problem-oriented review,” Expert Systems with Applications, vol. 211, p. 118624, 2023.
- R. Newbury, M. Gu, L. Chumbley et al., “Deep learning approaches to grasp synthesis: A review,” TRO, pp. 1–22, 2023.
- M. Zare et al., “A survey of imitation learning: Algorithms, recent developments, and challenges,” arXiv:2309.02473, 2023.
- X. Xiao, J. Liu, Z. Wang et al., “Robot learning in the era of foundation models: A survey,” arXiv:2311.14379, 2023.
- A. I. Weinberg et al., “Survey of learning approaches for robotic in-hand manipulation,” arXiv:2401.07915, 2024.
- J. Chen, B. Ganguly, Y. Xu, Y. Mei, T. Lan, and V. Aggarwal, “Deep generative models for offline policy learning: Tutorial, survey, and perspectives on future directions,” arXiv:2402.13777, 2024.
- Y. Ma et al., “A survey on vision-language-action models for embodied ai,” arXiv:2405.14093, 2024.
- Z. Xu et al., “A survey on robotics with foundation models: toward embodied ai,” arXiv:2402.02385, 2024.
- B. Tekin et al., “Real-time seamless single shot 6d object pose prediction,” in CVPR, 2018, pp. 292–301.
- G. Zhai, D. Huang, S.-C. Wu et al., “Monograspnet: 6-dof grasping with a single rgb image,” in ICRA, 2023, pp. 1708–1714.
- C. Liu, K. Shi, K. Zhou, H. Wang, J. Zhang, and H. Dong, “Rgbgrasp: Image-based object grasping by capturing multiple views during robot arm movement with neural radiance fields,” arXiv:2311.16592, 2023.
- B. An et al., “Rgbmanip: Monocular image-based robotic manipulation through active object pose estimation,” arXiv:2310.03478, 2023.
- J. Kerr et al., “Evo-nerf: Evolving nerf for sequential robot grasping of transparent objects,” in CoRL, 2023, pp. 353–367.
- A. Agrawal et al., “Clear-splatting: Learning residual gaussian splats for transparent object manipulation,” in ICRA Workshop, 2024.
- A. Kirillov et al., “Segment anything,” in ICCV, 2023, pp. 4015–4026.
- J. Redmon and A. Angelova, “Real-time grasp detection using convolutional neural networks,” in ICRA, 2015, pp. 1316–1322.
- I. Lenz et al., “Deep learning for detecting robotic grasps,” IJRR, vol. 34, no. 4-5, pp. 705–724, 2015.
- M. Schwarz, A. Milan, A. S. Periyasamy, and S. Behnke, “Rgb-d object detection and semantic segmentation for autonomous manipulation in clutter,” IJRR, vol. 37, no. 4-5, pp. 437–451, 2018.
- S. Kumra and C. Kanan, “Robotic grasp detection using deep convolutional neural networks,” in IROS, 2017, pp. 769–776.
- J. Varley, C. DeChant, A. Richardson, J. Ruales, and P. Allen, “Shape completion enabled robotic grasping,” in IROS, 2017, pp. 2442–2447.
- C. R. Qi et al., “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in CVPR, 2017, pp. 652–660.
- H. Liang et al., “Pointnetgpd: Detecting grasp configurations from point sets,” in ICRA, 2019, pp. 3629–3635.
- C. Zhong et al., “3d implicit transporter for temporally consistent keypoint discovery,” in ICCV, 2023, pp. 3869–3880.
- B. Yang, S. Rosa et al., “Dense 3d object reconstruction from a single depth view,” TPAMI, vol. 41, no. 12, pp. 2820–2834, 2019.
- A. Goyal et al., “Rvt: Robotic view transformer for 3d object manipulation,” in CoRL, 2023, pp. 694–710.
- Y. Ze et al., “Gnfactor: Multi-task real robot learning with generalizable neural feature fields,” in CoRL, 2023, pp. 284–301.
- G. Lu et al., “Manigaussian: Dynamic gaussian splatting for multi-task robotic manipulation,” arXiv:2403.08321, 2024.
- Y. Li and D. Pathak, “Object-aware gaussian splatting for robotic manipulation,” in ICRA Workshop, 2024.
- W. Yuan et al., “Gelsight: High-resolution robot tactile sensors for estimating geometry and force,” Sensors, vol. 17, no. 12, p. 2762, 2017.
- M. Lambeta et al., “Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation,” RAL, vol. 5, no. 3, pp. 3838–3845, 2020.
- O. Azulay, N. Curtis et al., “Allsight: A low-cost and high-resolution round tactile sensor with zero-shot learning capability,” RAL, vol. 9, no. 1, pp. 483–490, 2023.
- T.-H. Pham et al., “Hand-object contact force estimation from markerless visual tracking,” TPAMI, vol. 40, no. 12, pp. 2883–2896, 2018.
- M. A. Lee, Y. Zhu, P. Zachares, M. Tan et al., “Making sense of vision and touch: Learning multimodal representations for contact-rich tasks,” TRO, vol. 36, no. 3, pp. 582–596, 2020.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- R. Calandra et al., “The feeling of success: Does touch sensing help predict grasp outcomes?” in CoRL, 2017, pp. 314–323.
- I. Guzey, B. Evans, S. Chintala, and L. Pinto, “Dexterity from touch: Self-supervised pre-training of tactile representations with robotic play,” in CoRL, 2023, pp. 3142–3166.
- R. Calandra et al., “More than a feeling: Learning to grasp and regrasp using vision and touch,” RAL, vol. 3, no. 4, pp. 3300–3307, 2018.
- R. Gao et al., “Objectfolder: A dataset of objects with implicit visual, auditory, and tactile representations,” in CoRL, 2022, pp. 466–476.
- S. Wang, M. Lambeta, P.-W. Chou, and R. Calandra, “Tacto: A fast, flexible, and open-source simulator for high-resolution vision-based tactile sensors,” RAL, vol. 7, no. 2, pp. 3930–3937, 2022.
- J. Xu et al., “Efficient tactile simulation with differentiability for robotic manipulation,” in CoRL, 2023, pp. 1488–1498.
- S. Zhong et al., “Touching a nerf: Leveraging neural radiance fields for tactile sensory data generation,” in CoRL, 2022, pp. 1618–1628.
- Y. Dou, F. Yang, Y. Liu, A. Loquercio, and A. Owens, “Tactile-augmented radiance fields,” arXiv:2405.04534, 2024.
- J. Liu et al., “Deep learning-based object pose estimation: A comprehensive survey,” arXiv:2405.07801, 2024.
- D. De Gregorio, R. Zanella, G. Palli, and L. Di Stefano, “Effective deployment of cnns for 3dof pose estimation and grasping in industrial settings,” in ICPR, 2021, pp. 7419–7426.
- A. Tejani, R. Kouskouridas et al., “Latent-class hough forests for 6 dof object pose estimation,” TPAMI, vol. 40, no. 1, pp. 119–132, 2018.
- I. Shugurov, S. Zakharov et al., “Dpodv2: Dense correspondence-based 6 dof pose estimation,” TPAMI, vol. 44, no. 11, pp. 7417–7435, 2022.
- Y. Xiang et al., “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” arXiv:1711.00199, 2017.
- Z. Li, G. Wang, and X. Ji, “Cdpn: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation,” in ICCV, 2019, pp. 7678–7687.
- C. Wang et al., “Densefusion: 6d object pose estimation by iterative dense fusion,” in CVPR, 2019, pp. 3343–3352.
- S. Iwase et al., “Repose: Fast 6d object pose refinement via deep texture rendering,” in ICCV, 2021, pp. 3303–3312.
- L. Xu et al., “6d-diff: A keypoint diffusion framework for 6d object pose estimation,” in CVPR, 2024, pp. 9676–9686.
- X.-S. Gao et al., “Complete solution classification for the perspective-three-point problem,” TPAMI, vol. 25, no. 8, pp. 930–943, 2003.
- V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o(n) solution to the pnp problem,” IJCV, vol. 81, pp. 155–166, 2009.
- Z. Dang, L. Wang, Y. Guo, and M. Salzmann, “Match normalization: Learning-based point cloud registration for 6d object pose estimation in the real world,” TPAMI, 2024.
- J. Zhou et al., “Deep fusion transformer network with weighted vector-wise keypoints voting for robust 6d object pose estimation,” in ICCV, 2023, pp. 13 967–13 977.
- H. Wang et al., “Normalized object coordinate space for category-level 6d object pose and size estimation,” in CVPR, 2019, pp. 2642–2651.
- K. Chen and Q. Dou, “Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation,” in ICCV, 2021, pp. 2773–2782.
- Y. Hai et al., “Rigidity-aware detection for 6d object pose estimation,” in CVPR, 2023, pp. 8927–8936.
- J. Corsetti et al., “Revisiting fully convolutional geometric features for object 6d pose estimation,” in ICCV, 2023, pp. 2103–2112.
- X. Li et al., “Category-level articulated object pose estimation,” in CVPR, 2020, pp. 3706–3715.
- L. Liu et al., “Toward real-world category-level articulation pose estimation,” TIP, vol. 31, pp. 1072–1083, 2022.
- Y. Liu et al., “Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images,” in ECCV, 2022, pp. 298–315.
- V. N. Nguyen et al., “Gigapose: Fast and robust novel object pose estimation via one correspondence,” in CVPR, 2024, pp. 9903–9913.
- J. Sun et al., “Onepose: One-shot object pose estimation without cad models,” in CVPR, 2022, pp. 6825–6834.
- W. Goodwin, S. Vaze, I. Havoutis, and I. Posner, “Zero-shot category-level object pose estimation,” in ECCV, 2022, pp. 516–532.
- J. Lin et al., “Sam-6d: Segment anything model meets zero-shot 6d object pose estimation,” in CVPR, 2024, pp. 27 906–27 916.
- B. Wen et al., “Foundationpose: Unified 6d pose estimation and tracking of novel objects,” in CVPR, 2024, pp. 17 868–17 879.
- J. J. Gibson, “The theory of affordances,” Hilldale, USA, vol. 1, no. 2, pp. 67–82, 1977.
- Y. Li, N. Zhao, J. Xiao et al., “Laso: Language-guided affordance segmentation on 3d object,” in CVPR, 2024, pp. 14 251–14 260.
- H. S. Koppula and A. Saxena, “Anticipating human activities using object affordances for reactive robotic response,” TPAMI, vol. 38, no. 1, pp. 14–29, 2016.
- T.-T. Do et al., “Affordancenet: An end-to-end deep learning approach for object affordance detection,” in ICRA, 2018, pp. 5882–5889.
- T. Nagarajan et al., “Grounded human-object interaction hotspots from video,” in ICCV, 2019, pp. 8688–8697.
- S. Bahl et al., “Affordances from human videos as a versatile representation for robotics,” in CVPR, 2023, pp. 13 778–13 790.
- Y. Ju, K. Hu, G. Zhang, G. Zhang et al., “Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation,” arXiv:2401.07487, 2024.
- Y. Kuang et al., “Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation,” arXiv:2407.04689, 2024.
- T. Nguyen, M. N. Vu, A. Vuong et al., “Open-vocabulary affordance detection in 3d point clouds,” in IROS, 2023, pp. 5692–5698.
- K. Mo, L. J. Guibas, M. Mukadam et al., “Where2act: From pixels to actions for articulated 3d objects,” in ICCV, 2021, pp. 6813–6823.
- F. Xiang et al., “Sapien: A simulated part-based interactive environment,” in CVPR, 2020, pp. 11 097–11 107.
- Y. Wang, R. Wu, K. Mo et al., “Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions,” in ECCV, 2022, pp. 90–107.
- Y. Zhao et al., “Dualafford: Learning collaborative visual affordance for dual-gripper manipulation,” in ICLR, 2023.
- L. Wang et al., “Self-supervised learning of action affordances as interaction modes,” in ICRA, 2023, pp. 7279–7286.
- P. Mazzaglia et al., “Information-driven affordance discovery for efficient robotic manipulation,” arXiv:2405.03865, 2024.
- C. Ning et al., “Where2explore: Few-shot affordance learning for unseen novel categories of articulated objects,” NeurIPS, vol. 36, 2024.
- S. Ling et al., “Articulated object manipulation with coarse-to-fine affordance for mitigating the effect of point cloud noise,” arXiv:2402.18699, 2024.
- R. Wu et al., “Learning environment-aware affordance for 3d articulated object manipulation under occlusions,” NeurIPS, vol. 36, 2024.
- Y. Geng et al., “Rlafford: End-to-end affordance learning for robotic manipulation,” in ICRA, 2023, pp. 5880–5886.
- D. Kalashnikov et al., “Scalable deep reinforcement learning for vision-based robotic manipulation,” in CoRL, 2018, pp. 651–673.
- Y. Liang, K. Ellis, and J. Henriques, “Rapid motor adaptation for robotic manipulator arms,” in CVPR, 2024, pp. 16 404–16 413.
- R. Boney et al., “Regularizing model-based planning with energy-based models,” in CoRL, 2020, pp. 182–191.
- P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, Downs et al., “Implicit behavioral cloning,” in CoRL, 2022, pp. 158–168.
- Y. Yue et al., “Implicit distributional reinforcement learning,” NeurIPS, vol. 33, pp. 7135–7147, 2020.
- M. Liu, T. He, M. Xu, and W. Zhang, “Energy-based imitation learning,” arXiv:2004.09395, 2020.
- C. Chi et al., “Diffusion policy: Visuomotor policy learning via action diffusion,” arXiv:2303.04137, 2023.
- A. Ajay et al., “Is conditional generative modeling all you need for decision-making?” arXiv:2211.15657, 2022.
- Z. Wang, J. J. Hunt, and M. Zhou, “Diffusion policies as an expressive policy class for offline reinforcement learning,” in ICLR, 2023.
- X. Ma et al., “Hierarchical diffusion policy for kinematics-aware multi-task robotic manipulation,” arXiv:2403.03890, 2024.
- T. Wu et al., “Unidexfpm: Universal dexterous functional pre-grasp manipulation via diffusion policy,” arXiv:2403.12421, 2024.
- M. Reuss, M. Li et al., “Goal-conditioned imitation learning using score-based diffusion policies,” arXiv:2304.02532, 2023.
- M. Reuss et al., “Multimodal diffusion transformer: Learning versatile behavior from multimodal goals,” in ICRA Workshop, 2024.
- H. Li et al., “Language-guided object-centric diffusion policy for collision-aware robotic manipulation,” arXiv:2407.00451, 2024.
- T. Huang et al., “Value-informed skill chaining for policy learning of long-horizon tasks with surgical robot,” in IROS, 2023, pp. 8495–8501.
- J. Lv et al., “Sam-rl: Sensing-aware model-based reinforcement learning via differentiable physics-based simulation and rendering,” arXiv:2210.15185, 2022.
- A. Mandlekar et al., “What matters in learning from offline human demonstrations for robot manipulation,” in CoRL, 2021.
- S. Levine et al., “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” arXiv:2005.01643, 2020.
- T. Huang et al., “Demonstration-guided reinforcement learning with efficient exploration for task automation of surgical robot,” in ICRA, 2023, pp. 4640–4647.
- T. Xie et al., “Text2reward: Automated dense reward function generation for reinforcement learning,” arXiv:2309.11489, 2023.
- Y. J. Ma et al., “Eureka: Human-level reward design via coding large language models,” arXiv:2310.12931, 2023.
- S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert, “Learning movement primitives,” in ISRR, 2005, pp. 561–572.
- S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in AISTATS, 2011, pp. 627–635.
- X. Lin et al., “Spawnnet: Learning generalizable visuomotor skills from pre-trained networks,” arXiv:2307.03567, 2023.
- T. Z. Zhao et al., “Learning fine-grained bimanual manipulation with low-cost hardware,” arXiv:2304.13705, 2023.
- A. Mandlekar et al., “Mimicgen: A data generation system for scalable robot learning using human demonstrations,” arXiv:2310.17596, 2023.
- F. Ebert et al., “Bridge data: Boosting generalization of robotic skills with cross-domain datasets,” arXiv:2109.13396, 2021.
- A. Padalkar et al., “Open x-embodiment: Robotic learning datasets and rt-x models,” arXiv:2310.08864, 2023.
- V. Jain et al., “Vid2robot: End-to-end video-conditioned policy learning with cross-attention transformers,” arXiv:2403.12943, 2024.
- P. Li et al., “Ag2manip: Learning novel manipulation skills with agent-agnostic visual and action representations,” arXiv:2404.17521, 2024.
- J. Zeng et al., “Learning manipulation by predicting interaction,” arXiv:2406.00439, 2024.
- A. Simeonov et al., “Se (3)-equivariant relational rearrangement with neural descriptor fields,” in CoRL, 2023, pp. 835–846.
- E. Chun, Y. Du, A. Simeonov, T. Lozano-Perez, and L. Kaelbling, “Local neural descriptor fields: Locally conditioned object representations for manipulation,” in ICRA, 2023, pp. 1830–1836.
- H. Ryu et al., “Equivariant descriptor fields: Se (3)-equivariant energy-based models for end-to-end visual robotic manipulation learning,” arXiv:2206.08321, 2022.
- J. Brehmer et al., “Edgi: Equivariant diffusion for planning with embodied agents,” NeurIPS, vol. 36, 2024.
- H. Ryu, J. Kim et al., “Diffusion-edfs: Bi-equivariant denoising generative modeling on se (3) for visual robotic manipulation,” arXiv:2309.02685, 2023.
- J. Urain et al., “Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion,” in ICRA, 2023, pp. 5923–5930.
- Y. Xu et al., “Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy,” in CVPR, 2023, pp. 4737–4746.
- W. Wan et al., “Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning,” in ICCV, 2023, pp. 3891–3902.
- Y. Hu et al., “Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning,” arXiv:2311.17842, 2023.
- A. Szot et al., “Grounding multimodal large language models in actions,” arXiv:2406.07904, 2024.
- M. J. Kim et al., “Openvla: An open-source vision-language-action model,” arXiv:2406.09246, 2024.
- H. Zhen et al., “3d-vla: 3d vision-language-action generative world model,” arXiv:2403.09631, 2024.
- X. Zhang et al., “Learning insertion primitives with discrete-continuous hybrid action space for robotic assembly tasks,” in ICRA, 2022, pp. 9881–9887.
- A. v. d. Oord et al., “Representation learning with contrastive predictive coding,” arXiv:1807.03748, 2018.
- T. Zhang, S. Guo et al., “Adjacency constraint for efficient hierarchical reinforcement learning,” TPAMI, vol. 45, no. 4, pp. 4152–4166, 2023.
- S. Hu, L. Shen et al., “On transforming reinforcement learning with transformers: The development trajectory,” TPAMI, 2024.
- F. Torabi et al., “Behavioral cloning from observation,” arXiv:1805.01954, 2018.
- J. Ho and S. Ermon, “Generative adversarial imitation learning,” NeurIPS, vol. 29, 2016.
- O. M. Andrychowicz et al., “Learning dexterous in-hand manipulation,” IJRR, vol. 39, no. 1, pp. 3–20, 2020.
- J. Schulman et al., “Proximal policy optimization algorithms,” arXiv:1707.06347, 2017.
- J. Fu et al., “One-shot learning of manipulation skills with online dynamics adaptation and neural network priors,” in IROS, 2016, pp. 4019–4026.
- B. D. Ziebart et al., “Maximum entropy inverse reinforcement learning,” in AAAI, vol. 8, 2008, pp. 1433–1438.
- H. Kim et al., “Transformer-based deep imitation learning for dual-arm robot manipulation,” in IROS, 2021, pp. 8965–8972.
- A. Simeonov et al., “Neural descriptor fields: Se (3)-equivariant object representations for manipulation,” in ICRA, 2022, pp. 6394–6400.
- R. Gong, J. Huang, Y. Zhao, H. Geng et al., “Arnold: A benchmark for language-grounded task learning with continuous states in realistic 3d scenes,” in ICCV, 2023, pp. 20 483–20 495.
- Y. Li et al., “Grasp multiple objects with one hand,” RAL, 2024.
- U. Asif et al., “Rgb-d object recognition and grasp detection using hierarchical cascaded forests,” TRO, vol. 33, no. 3, pp. 547–564, 2017.
- D.-H. Zhai et al., “Fanet: Fast and accurate robotic grasp detection based on keypoints,” TASE, 2023.
- H.-S. Fang et al., “Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,” TRO, 2023.
- N. Marturi et al., “Dynamic grasp and trajectory planning for moving objects,” Autonomous Robots, vol. 43, pp. 1241–1256, 2019.
- D. Morrison et al., “Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach,” arXiv:1804.05172, 2018.
- N. Heppert et al., “Ditto: Demonstration imitation by trajectory transformation,” arXiv:2403.15203, 2024.
- H. Fang et al., “Transcg: A large-scale real-world dataset for transparent object depth completion and a grasping baseline,” RAL, vol. 7, no. 3, pp. 7383–7390, 2022.
- J. Ichnowski et al., “Dex-nerf: Using a neural radiance field to grasp transparent objects,” in CoRL, 2022, pp. 526–536.
- Q. Dai et al., “Graspnerf: Multiview-based 6-dof grasp detection for transparent and specular objects using generalizable nerf,” in ICRA, 2023, pp. 1757–1763.
- J. Lee et al., “Nfl: Normal field learning for 6-dof grasping of transparent objects,” RAL, 2024.
- J. Kim et al., “Transpose: Large-scale multispectral dataset for transparent object,” IJRR, p. 02783649231213117, 2023.
- H. Yu et al., “Tgf-net: Sim2real transparent object 6d pose estimation based on geometric fusion,” RAL, vol. 8, no. 6, pp. 3868–3875, 2023.
- M. Sundermeyer et al., “Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes,” in ICRA, 2021, pp. 13 438–13 444.
- B. Wen et al., “Catgrasp: Learning category-level task-relevant grasping in clutter from simulation,” in ICRA, 2022, pp. 6401–6408.
- A. Murali et al., “6-dof grasping for target-driven object manipulation in clutter,” in ICRA, 2020, pp. 6232–6238.
- J. Lundell et al., “Ddgc: Generative deep dexterous grasping in clutter,” RAL, vol. 6, no. 4, pp. 6899–6906, 2021.
- C. Wang et al., “Graspness discovery in clutters for fast and accurate grasp detection,” in ICCV, 2021, pp. 15 964–15 973.
- B. Wei et al., “Discriminative active learning for robotic grasping in cluttered scene,” RAL, vol. 8, no. 3, pp. 1858–1865, 2023.
- K. Xu et al., “Efficient learning of goal-oriented push-grasping synergy in clutter,” RAL, vol. 6, no. 4, pp. 6337–6344, 2021.
- M. Kiatos and S. Malassiotis, “Robust object grasping in clutter via singulation,” in ICRA, 2019, pp. 1596–1600.
- Y. Yang et al., “A deep learning approach to grasping the invisible,” RAL, vol. 5, no. 2, pp. 2232–2239, 2020.
- K. Xu et al., “A joint modeling of vision-language-action for target-oriented grasping in clutter,” in ICRA, 2023, pp. 11 597–11 604.
- W. Wang, R. Li, Z. M. Diekel, Y. Chen, Z. Zhang, and Y. Jia, “Controlling object hand-over in human–robot collaboration via natural wearable sensing,” THMS, vol. 49, no. 1, pp. 59–71, 2018.
- W. Wang, R. Li, Y. Chen, Y. Sun, and Y. Jia, “Predicting human intentions in human–robot hand-over tasks through multimodal learning,” TASE, vol. 19, no. 3, pp. 2339–2353, 2021.
- W. Yang et al., “Reactive human-to-robot handovers of arbitrary objects,” in ICRA, 2021, pp. 3118–3124.
- G. Zhang et al., “Flexible handover with real-time robust dynamic grasp trajectory generation,” in IROS, 2023, pp. 3192–3199.
- Z. Wang et al., “Genh2r: Learning generalizable human-to-robot handover via scalable simulation, demonstration, and imitation,” arXiv:2401.00929, 2024.
- X. Ye and S. Liu, “Velocity decomposition based planning algorithm for grasping moving object,” in DDCLS, 2018, pp. 644–649.
- I. Akinola et al., “Dynamic grasping with reachability and motion awareness,” in IROS, 2021, pp. 9422–9429.
- J. Liu et al., “Target-referenced reactive grasping for dynamic objects,” in CVPR, 2023, pp. 8824–8833.
- W. C. Agboh, J. Ichnowski, K. Goldberg, and M. R. Dogar, “Multi-object grasping in the plane,” in ISRR, 2022, pp. 222–238.
- W. C. Agboh et al., “Learning to efficiently plan robust frictional multi-object grasps,” in IROS, 2023, pp. 10 660–10 667.
- T. Chen and Y. Sun, “Multi-object grasping–experience forest for robotic finger movement strategies,” RAL, 2024.
- S. Aeron et al., “Push-mog: Efficient pushing to consolidate polygonal objects for multi-object grasping,” in CASE, 2023, pp. 1–6.
- K. Yao and A. Billard, “Exploiting kinematic redundancy for robotic grasping of multiple objects,” TRO, vol. 39, no. 3, pp. 1982–2002, 2023.
- L. Berscheid et al., “Self-supervised learning for precise pick-and-place without object model,” RAL, vol. 5, no. 3, pp. 4828–4835, 2020.
- G. Zhai et al., “Sg-bot: Object rearrangement via coarse-to-fine robotic imagination on scene graphs,” arXiv:2309.12188, 2023.
- K. Zakka et al., “Form2fit: Learning shape priors for generalizable assembly from disassembly,” in ICRA, 2020, pp. 9404–9410.
- Z. Huang et al., “Mesh-based dynamics with occlusion reasoning for cloth manipulation,” in RSS, 2022.
- P. Mitrano et al., “Learning where to trust unreliable models in an unstructured world for deformable object manipulation,” Science Robotics, vol. 6, no. 54, p. eabd8170, 2021.
- Y. Li, S. Li, V. Sitzmann et al., “3d neural scene representations for visuomotor control,” in CoRL, 2022, pp. 112–123.
- H. Geng et al., “Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts,” in CVPR, 2023, pp. 7081–7091.
- Y. Li et al., “Unidoormanip: Learning universal door manipulation policy over large-scale and diverse door manipulation environments,” arXiv:2403.02604, 2024.
- H. Geng et al., “Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations,” in CVPR, 2023, pp. 2978–2988.
- B. Sundaralingam and T. Hermans, “Relaxed-rigidity constraints: kinematic trajectory optimization and collision avoidance for in-grasp manipulation,” Autonomous Robots, vol. 43, pp. 469–483, 2019.
- D. Rus, “In-hand dexterous manipulation of piecewise-smooth 3-d objects,” IJRR, vol. 18, no. 4, pp. 355–381, 1999.
- A. Nagabandi et al., “Deep dynamics models for learning dexterous manipulation,” in CoRL, 2020, pp. 1101–1112.
- T. Chen et al., “A system for general in-hand object re-orientation,” in CoRL, 2022, pp. 297–307.
- S. P. Arunachalam et al., “Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation,” in ICRA, 2023, pp. 5954–5961.
- S. Li et al., “Dexdeform: Dexterous deformable object manipulation with human demonstrations and differentiable physics,” in ICLR, 2023.
- Z. Qin et al., “Keto: Learning keypoint representations for tool manipulation,” in ICRA, 2020, pp. 7278–7285.
- K. Fang, Y. Zhu, A. Garg et al., “Learning task-oriented grasping for tool manipulation from simulated self-supervision,” IJRR, vol. 39, no. 2-3, pp. 202–216, 2020.
- X. Lin et al., “Diffskill: Skill abstraction from differentiable physics for deformable object manipulations with tools,” in ICLR, 2022.
- K. P. Tee et al., “A framework for tool cognition in robots without prior tool learning or observation,” Nature Machine Intelligence, vol. 4, no. 6, pp. 533–543, 2022.
- A. Z. Ren et al., “Leveraging language for accelerated learning of tool manipulation,” in CoRL, 2023, pp. 1531–1541.
- M. Xu et al., “Creative robot tool use with large language models,” arXiv:2310.13065, 2023.
- E. Jang et al., “End-to-end learning of semantic grasping,” in CoRL, 2017, pp. 119–132.
- H. Sun, Z. Zhang, H. Wang, Y. Wang, and Q. Cao, “A novel robotic grasp detection framework using low-cost rgb-d camera for industrial bin picking,” TIM, vol. 73, pp. 1–12, 2023.
- S. Ge et al., “Pixel-level collision-free grasp prediction network for medical test tube sorting on cluttered trays,” RAL, vol. 8, no. 12, pp. 7897–7904, 2023.
- S. D’Avella et al., “The cluttered environment picking benchmark (cepb) for advanced warehouse automation: Evaluating the perception, planning, control, and grasping of manipulation systems,” RAM, 2023.
- J. Jiang et al., “Robotic perception of transparent objects: A review,” TAI, vol. 1, no. 01, pp. 1–21, 2023.
- D. Morrison et al., “Learning robust, real-time, reactive robotic grasping,” IJRR, vol. 39, no. 2-3, pp. 183–201, 2020.
- Y. Sun, E. Amatova, and T. Chen, “Multi-object grasping-types and taxonomy,” in ICRA, 2022, pp. 777–783.
- M. Zhu et al., “Single image 3d object detection and pose estimation for grasping,” in ICRA, 2014, pp. 3936–3943.
- E. Frazzoli et al., “Maneuver-based motion planning for nonlinear systems with symmetries,” TRO, vol. 21, no. 6, pp. 1077–1091, 2005.
- Y. Zhu et al., “Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation,” RAL, vol. 7, no. 2, pp. 4126–4133, 2022.
- R. Jangir et al., “Look closer: Bridging egocentric and third-person views with transformers for robotic manipulation,” RAL, vol. 7, no. 2, pp. 3046–3053, 2022.
- J. Liu et al., “Self-corrected multimodal large language model for end-to-end robot manipulation,” arXiv:2405.17418, 2024.
- A. M. Okamura et al., “An overview of dexterous manipulation,” in ICRA, vol. 1, 2000, pp. 255–262.
- T. Zhu, R. Wu, J. Hang, X. Lin, and Y. Sun, “Toward human-like grasp: Functional grasp by dexterous robotic hand via object-hand semantic representation,” TPAMI, vol. 45, no. 10, pp. 12 521–12 534, 2023.
- M. Qin et al., “Robot tool use: A survey,” Frontiers in Robotics and AI, vol. 9, p. 1009488, 2023.
- Y. Shirai, D. K. Jha, A. U. Raghunathan, and D. Hong, “Tactile tool manipulation,” in ICRA, 2023, pp. 12 597–12 603.
- Y. Zhu et al., “Understanding tools: Task-oriented object modeling, learning and recognition,” in CVPR, 2015, pp. 2855–2864.
- N. Saito et al., “How to select and use tools?: Active perception of target objects using multimodal deep learning,” RAL, vol. 6, no. 2, pp. 2517–2524, 2021.
- Y. Jiang et al., “Efficient grasping from rgbd images: Learning using a new rectangle representation,” in ICRA, 2011, pp. 3304–3311.
- F.-J. Chu et al., “Real-world multiobject, multigrasp detection,” RAL, vol. 3, no. 4, pp. 3355–3362, 2018.
- A. Depierre et al., “Jacquard: A large scale dataset for robotic grasp detection,” in IROS, 2018, pp. 3511–3516.
- X. Yan et al., “Learning 6-dof grasping interaction via deep geometry-aware 3d representations,” in ICRA, 2018, pp. 3766–3773.
- C. Eppner et al., “Acronym: A large-scale grasp dataset based on simulation,” in ICRA, 2021, pp. 6222–6227.
- D. Morrison et al., “Egad! an evolved grasping analysis dataset for diversity and reproducibility in robotic manipulation,” RAL, vol. 5, no. 3, pp. 4368–4375, 2020.
- H.-S. Fang et al., “Graspnet-1billion: A large-scale benchmark for general object grasping,” in CVPR, 2020, pp. 11 444–11 453.
- A. D. Vuong et al., “Grasp-anything: Large-scale grasp dataset from foundation models,” arXiv:2309.09818, 2023.
- B. Calli et al., “Yale-cmu-berkeley dataset for robotic manipulation research,” IJRR, vol. 36, no. 3, pp. 261–268, 2017.
- L. Liu et al., “Akb-48: A real-world articulated object knowledge base,” in CVPR, 2022, pp. 14 809–14 818.
- J. Gu, F. Xiang, X. Li, Z. Ling et al., “Maniskill2: A unified benchmark for generalizable manipulation skills,” ICLR, 2023.
- Y. Chen, Y. Geng, F. Zhong et al., “Bi-dexhands: Towards human-level bimanual dexterous manipulation,” TPAMI, no. 01, pp. 1–15, 2023.
- C. Bao et al., “Dexart: Benchmarking generalizable dexterous manipulation with articulated objects,” in CVPR, 2023, pp. 21 190–21 200.
- C. Li, R. Zhang, J. Wong, C. Gokmen et al., “Behavior-1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation,” arXiv:2403.09227, 2024.
- A. Saxena et al., “Robotic grasping of novel objects using vision,” IJRR, vol. 27, no. 2, pp. 157–173, 2008.
- M. Kyrarini, M. A. Haseeb, D. Ristić-Durrant, and A. Gräser, “Robot learning of industrial assembly task via human demonstrations,” Autonomous Robots, vol. 43, pp. 239–257, 2019.
- K. Onur, O. T. Kaymakci, and M. Mercimek, “Advanced predictive maintenance with machine learning failure estimation in industrial packaging robots,” in DAS, 2020, pp. 1–6.
- J. Oyekan et al., “Applying a 6 dof robotic arm and digital twin to automate fan-blade reconditioning for aerospace maintenance, repair, and overhaul,” Sensors, vol. 20, no. 16, p. 4637, 2020.
- R. Lee et al., “Learning arbitrary-goal fabric folding with one hour of real robot experience,” in CoRL, 2021, pp. 2317–2327.
- R. Ye et al., “Rcare world: A human-centric simulation world for caregiving robots,” in IROS, 2022, pp. 33–40.
- J. Liu, Y. Chen, Z. Dong, S. Wang, S. Calinon, M. Li, and F. Chen, “Robot cooking with stir-fry: Bimanual non-prehensile manipulation of semi-fluid objects,” RAL, vol. 7, no. 2, pp. 5159–5166, 2022.
- J. Lu et al., “Super deep: A surgical perception framework for robotic tissue manipulation using deep learning for feature extraction,” in ICRA, 2021, pp. 4783–4789.
- S. Krishnan et al., “Transition state clustering: Unsupervised surgical trajectory segmentation for robot learning,” IJRR, vol. 36, no. 13-14, pp. 1595–1618, 2017.
- Y. Long et al., “Human-in-the-loop embodied intelligence with interactive simulation environment for surgical robot learning,” RAL, 2023.
- X. Wang et al., “Robust adaptive learning control of space robot for target capturing using neural network,” TNNLS, vol. 34, no. 10, pp. 7567–7577, 2023.
- B. Zhong and L. Xia, “A systematic review on exploring the potential of educational robotics in mathematics education,” Int J Sci Math Educ, vol. 18, no. 1, pp. 79–101, 2020.
- S. James et al., “Rlbench: The robot learning benchmark & learning environment,” RAL, vol. 5, no. 2, pp. 3019–3026, 2020.
- X. B. Peng, M. Andrychowicz et al., “Sim-to-real transfer of robotic control with dynamics randomization,” in ICRA, 2018, pp. 3803–3810.
- E. Aljalbout, F. Frank, M. Karl et al., “On the role of the action space in robot manipulation learning and sim-to-real transfer,” RAL, vol. 9, no. 6, pp. 5895–5902, 2024.
- Y. Jiang et al., “Transic: Sim-to-real policy transfer by learning from online correction,” arXiv:2405.10315, 2024.
- F. Muratore, M. Gienger, and J. Peters, “Assessing transferability from simulation to reality for reinforcement learning,” TPAMI, vol. 43, no. 4, pp. 1172–1183, 2021.
- H. Ma, M. Shi et al., “Generalizing 6-dof grasp detection via domain prior knowledge,” in CVPR, 2024, pp. 18 102–18 111.
- K. Chen et al., “Sim-to-real 6d object pose estimation via iterative self-training for robotic bin picking,” in ECCV, 2022, pp. 533–550.
- D. Driess, F. Xia, M. S. Sajjadi et al., “Palm-e: An embodied multimodal language model,” in ICML, 2023, pp. 8469–8488.
- X. Li et al., “Manipllm: Embodied multimodal large language model for object-centric robotic manipulation,” arXiv:2312.16217, 2023.
- J. Xu et al., “Reasoning tuning grasp: Adapting multi-modal large language models for robotic grasping,” in CoRL Workshop, 2023.
- S. Huang, I. Ponomarenko, Z. Jiang et al., “Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large language models,” arXiv:2403.11289, 2024.
- A. Ajoudani et al., “Progress and prospects of the human–robot collaboration,” Autonomous Robots, vol. 42, pp. 957–975, 2018.
- Z. Jin et al., “A learning based hierarchical control framework for human–robot collaboration,” TASE, vol. 20, no. 1, pp. 506–517, 2023.
- C. Wang et al., “Co-gail: Learning diverse strategies for human-robot collaboration,” in CoRL, 2022, pp. 1279–1290.
- T.-W. Chin et al., “Towards efficient model compression via learned global ranking,” in CVPR, 2020, pp. 1518–1528.
- B. Zitkovich et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in CoRL, 2023, pp. 2165–2183.
- T.-H. Wang et al., “Measuring interpretability of neural policies of robots with disentangled representation,” in CoRL, 2023, pp. 602–641.
- X. Li, Z. Serlin, G. Yang, and C. Belta, “A formal methods approach to interpretable reinforcement learning for robotic planning,” Science Robotics, vol. 4, no. 37, p. eaay6276, 2019.
- J. Ji et al., “Safety gymnasium: A unified safe reinforcement learning benchmark,” NeurIPS, vol. 36, 2023.
- Y. Jia, C. M. Poskitt, J. Sun et al., “Physical adversarial attack on a robotic arm,” RAL, vol. 7, no. 4, pp. 9334–9341, 2022.
- M. Machin et al., “Smof: A safety monitoring framework for autonomous systems,” TSMC, vol. 48, no. 5, pp. 702–715, 2016.
- M. Savva et al., “Semantically-enriched 3d models for common-sense knowledge,” in CVPR Workshop, 2015, pp. 24–31.
- A. Gupta et al., “Lvis: A dataset for large vocabulary instance segmentation,” in CVPR, 2019, pp. 5356–5364.
- R. Rombach et al., “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022, pp. 10 684–10 695.
- G. Brockman et al., “Openai gym,” arXiv:1606.01540, 2016.
- V. Makoviychuk et al., “Isaac gym: High performance gpu based physics simulation for robot learning,” in NeurIPS Datasets and Benchmarks Track, 2021.
- C. Chen, C. Liu, T. Wang, A. Zhang, W. Wu, and L. Cheng, “Compound fault diagnosis for industrial robots based on dual-transformer networks,” J Manuf Syst, vol. 66, pp. 163–178, 2023.
- Y. Shvets et al., “Robotics in agriculture: Advanced technologies in livestock farming and crop cultivation,” in E3S Web Conf., vol. 480, 2024, p. 03024.
- S. Van Delden et al., “Current status and future challenges in implementing and upscaling vertical farming systems,” Nature Food, vol. 2, no. 12, pp. 944–956, 2021.
- K. Kawaharazuka et al., “Continuous object state recognition for cooking robots using pre-trained vision-language models and black-box optimization,” RAL, 2024.
- Q. Yu et al., “Orbit-surgical: An open-simulation framework for learning surgical augmented dexterity,” arXiv:2404.16027, 2024.