ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos (2404.15709v3)
Abstract: In this work, we aim to learn a unified vision-based policy for multi-fingered robot hands to manipulate a variety of objects in diverse poses. Though prior work has shown benefits of using human videos for policy learning, performance gains have been limited by the noise in estimated trajectories. Moreover, reliance on privileged object information such as ground-truth object states further limits the applicability in realistic scenarios. To address these limitations, we propose a new framework ViViDex to improve vision-based policy learning from human videos. It first uses reinforcement learning with trajectory guided rewards to train state-based policies for each video, obtaining both visually natural and physically plausible trajectories from the video. We then rollout successful episodes from state-based policies and train a unified visual policy without using any privileged information. We propose coordinate transformation to further enhance the visual point cloud representation, and compare behavior cloning and diffusion policy for the visual policy training. Experiments both in simulation and on the real robot demonstrate that ViViDex outperforms state-of-the-art approaches on three dexterous manipulation tasks.
- S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “RLbench: The robot learning benchmark & learning environment,” RAL, 2020.
- P.-L. Guhur, S. Chen, R. G. Pinel, M. Tapaswi et al., “Instruction-driven history-aware policies for robotic manipulations,” in CoRL, 2023.
- D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery et al., “Palm-e: An embodied multimodal language model,” in ICML, 2023.
- S. Chen, R. Garcia, C. Schmid, and I. Laptev, “Polarnet: 3d point clouds for language-guided robotic manipulation,” in CoRL, 2023.
- M. Liu, X. Li, Z. Ling, Y. Li et al., “Frame Mining: a free lunch for learning robotic manipulation from 3D point clouds,” in CoRL, 2022.
- C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau et al., “Diffusion policy: Visuomotor policy learning via action diffusion,” in RSS, 2023.
- M. T. Mason, “Toward robotic manipulation,” Annual Review of Control, Robotics, and Autonomous Systems, 2018.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, 2015.
- L. P. Kaelbling et al., “Reinforcement learning: A survey,” Journal of artificial intelligence research, 1996.
- H. Qi, A. Kumar, R. Calandra, Y. Ma, and J. Malik, “In-hand object rotation via rapid motor adaptation,” in CoRL, 2022.
- Z.-H. Yin, B. Huang, Y. Qin, Q. Chen, and X. Wang, “Rotating without seeing: Towards in-hand dexterity through touch,” in RSS, 2023.
- H. Qi, B. Yi, S. Suresh, M. Lambeta, Y. Ma, R. Calandra, and J. Malik, “General in-hand object rotation with vision and touch,” in CoRL, 2023.
- I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin et al., “Solving rubik’s cube with a robot hand,” arXiv, 2019.
- H. Xu, Y. Luo, S. Wang, T. Darrell, and R. Calandra, “Towards learning to play piano with dexterous hands and touch,” in IROS, 2022.
- K. Zakka, P. Wu, L. Smith, N. Gileadi et al., “Robopianist: Dexterous piano playing with deep reinforcement learning,” in CoRL, 2023.
- J. Merel, Y. Tassa et al., “Learning human behaviors from motion capture by adversarial imitation,” arXiv, 2017.
- A. Rajeswaran et al., “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations,” in RSS, 2018.
- Y. Qin, W. Yang, B. Huang et al., “Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system,” in RSS, 2023.
- S. P. Arunachalam et al., “Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation,” in ICRA, 2023.
- S. P. Arunachalam, I. Güzey, S. Chintala, and L. Pinto, “Holo-Dex: Teaching dexterity with immersive mixed reality,” in ICRA, 2023.
- Y. Qin, Y.-H. Wu, S. Liu, H. Jiang et al., “DexMV: Imitation learning for dexterous manipulation from human videos,” in ECCV, 2022.
- P. Mandikal and K. Grauman, “DexVIP: Learning dexterous grasping with human hand pose priors from video,” in CoRL, 2022.
- Q. Liu et al., “DexRepNet: Learning dexterous robotic grasping network with geometric and spatial hand-object representations,” in IROS, 2023.
- K. Shaw, S. Bahl, and D. Pathak, “VideoDex: Learning dexterity from internet videos,” in CoRL, 2022.
- S. Dasari, A. Gupta, and V. Kumar, “Learning dexterous manipulation from exemplar object trajectories and pre-grasps,” in ICRA, 2023.
- Y.-H. Wu, J. Wang, and X. Wang, “Learning generalizable dexterous manipulation from human grasp affordance,” in CoRL, 2022.
- A. Bicchi and V. Kumar, “Robotic grasping and contact: A review,” in ICRA, 2000.
- L. Han and J. C. Trinkle, “Dextrous manipulation by rolling and finger gaiting,” in ICRA, 1998.
- D. Rus, “In-hand dexterous manipulation of piecewise-smooth 3D objects,” IJRR, 1999.
- I. Mordatch, Z. Popović, and E. Todorov, “Contact-invariant optimization for hand manipulation,” in SIGGRAPH, 2012.
- A. Wu et al., “Learning diverse and physically feasible dexterous grasps with generative model and bilevel optimization,” arXiv, 2022.
- S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta, “R3M: A universal visual representation for robot manipulation,” in CoRL, 2022.
- Y. Xu, W. Wan, J. Zhang, H. Liu, Z. Shan, H. Shen, R. Wang et al., “UniDexGrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy,” in CVPR, 2023.
- T. Chen, J. Xu, and P. Agrawal, “A system for general in-hand object re-orientation,” in CoRL, 2021.
- C. Bao, H. Xu, Y. Qin et al., “DexArt: Benchmarking generalizable dexterous manipulation with articulated objects,” in CVPR, 2023.
- W. Wan, H. Geng, Y. Liu, Z. Shan et al., “UniDexGrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning,” in ICCV, 2023.
- A. Nagabandi, K. Konolige, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation,” in CoRL, 2020.
- N. Hansen, Y. Lin, H. Su et al., “MoDem: Accelerating visual model-based reinforcement learning with demonstrations,” in ICLR, 2023.
- H. Jiang, S. Liu, J. Wang, and X. Wang, “Hand-object contact consistency reasoning for human grasps generation,” in ICCV, 2021.
- R. Rubinstein, “The cross-entropy method for combinatorial and continuous optimization,” 1999.
- Y.-W. Chao, W. Yang, Y. Xiang, P. Molchanov et al., “DexYCB: A benchmark for capturing hand grasping of objects,” in CVPR, 2021.
- Y. Liu, Y. Liu, C. Jiang, K. Lyu et al., “HOI4D: A 4D egocentric dataset for category-level human-object interaction,” in CVPR, 2022.
- Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh et al., “Learning joint reconstruction of hands and manipulated objects,” in CVPR, 2019.
- Z. Chen, S. Chen, C. Schmid et al., “gSDF: Geometry-Driven signed distance functions for 3D hand-object reconstruction,” in CVPR, 2023.
- Y. Ye, A. Gupta, and S. Tulsiani, “What’s in your hands? 3D reconstruction of generic objects in hands,” in CVPR, 2022.
- L. Smith, N. Dhawan, M. Zhang et al., “Aivd: Learning multi-stage tasks via pixel-level translation of human videos,” in RSS, 2020.
- K. Zakka, A. Zeng, P. Florence, J. Tompson, J. Bohg et al., “Xirl: Cross-embodiment inverse reinforcement learning,” in CoRL, 2021.
- F. Ebert, Y. Yang et al., “Bridge data: Boosting generalization of robotic skills with cross-domain datasets,” in RSS, 2022.
- K. Schmeckpeper et al., “Reinforcement learning with videos: Combining offline observations with interaction,” in CoRL, 2020.
- L. Shao, T. Migimatsu et al., “Concept2robot: Learning manipulation concepts from instructions and human demonstrations,” in RSS, 2020.
- T. Xiao, I. Radosavovic, T. Darrell, and J. Malik, “Masked visual pre-training for motor control,” arXiv, 2022.
- I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik et al., “Real-world robot learning with masked visual pre-training,” in CoRL, 2023.
- Y. Ze, Y. Liu, R. Shi, J. Qin, Z. Yuan, J. Wang, and H. Xu, “H-InDex:visual reinforcement learning with hand-informed representations for dexterous manipulation,” in NeurIPS, 2023.
- K. Grauman, A. Westbury, E. Byrne et al., “Ego4D: Around the world in 3,000 hours of egocentric video,” in CVPR, 2022.
- A. S. Chen, S. Nair, and C. Finn, “Learning generalizable robotic reward functions from” in-the-wild” human videos,” in RSS, 2021.
- M. Alakuijala, G. Dulac-Arnold et al., “Learning reward functions for robotic manipulation by observing humans,” in ICRA, 2023.
- E. Chane-Sane, C. Schmid, and I. Laptev, “Learning video-conditioned policies for unseen manipulation tasks,” in ICRA, 2023.
- H. Xiong, Q. Li, Y.-C. Chen et al., “Learning by watching: Physical imitation of manipulation skills from human videos,” in IROS, 2021.
- P. Mandikal and K. Grauman, “Learning dexterous grasping with object-centric visual affordances,” in ICRA, 2021.
- J. Romero, D. Tzionas, and M. J. Black, “Embodied Hands: Modeling and capturing hands and bodies together,” TOG, 2017.
- A. Handa, K. Van Wyk, W. Yang et al., “DexPilot: Vision-based teleoperation of dexterous robotic hand-arm system,” in ICRA, 2020.
- D. Antotsiou, G. Garcia-Hernando et al., “Task-oriented hand motion retargeting for dexterous manipulation imitation,” in ECCV, 2018.
- G. J. Steven, “The nlopt nonlinear-optimization package,” 2019.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford et al., “Proximal policy optimization algorithms,” arXiv, 2017.
- C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” in CVPR, 2017.
- E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in IROS, 2012.
- J. Schulman et al., “Trust region policy optimization,” in ICML, 2015.
- I. Radosavovic, X. Wang, L. Pinto, and J. Malik, “State-only imitation learning for dexterous manipulation,” in IROS, 2021.
- B. Kang, Z. Jie, and J. Feng, “Policy optimization with demonstrations,” in ICML, 2018.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.