General Flow as Foundation Affordance for Scalable Robot Learning (2401.11439v2)
Abstract: We address the challenge of acquiring real-world manipulation skills with a scalable framework. We hold the belief that identifying an appropriate prediction target capable of leveraging large-scale datasets is crucial for achieving efficient and universal learning. Therefore, we propose to utilize 3D flow, which represents the future trajectories of 3D points on objects of interest, as an ideal prediction target. To exploit scalable data resources, we turn our attention to human videos. We develop, for the first time, a language-conditioned 3D flow prediction model directly from large-scale RGBD human video datasets. Our predicted flow offers actionable guidance, thus facilitating zero-shot skill transfer in real-world scenarios. We deploy our method with a policy based on closed-loop flow prediction. Remarkably, without any in-domain finetuning, our method achieves an impressive 81\% success rate in zero-shot human-to-robot skill transfer, covering 18 tasks in 6 scenes. Our framework features the following benefits: (1) scalability: leveraging cross-embodiment data resources; (2) wide application: multiple object categories, including rigid, articulated, and soft bodies; (3) stable skill transfer: providing actionable guidance with a small inference domain-gap. Code, data, and supplementary materials are available https://general-flow.github.io
- Dexterous functional grasping. In Conference on Robot Learning, pages 3453–3467. PMLR, 2023.
- Motion perception in reinforcement learning with dynamic objects. In Conference on Robot Learning, pages 156–168. PMLR, 2018.
- Human-to-robot imitation in the wild. arXiv preprint arXiv:2207.09450, 2022.
- Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023.
- Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13702–13711, 2023.
- Method for registration of 3-d shapes. In Sensor fusion IV: control paradigms and data structures, volume 1611, pages 586–606. Spie, 1992.
- Towards generalizable zero-shot manipulation via translating human interaction plans. arXiv preprint arXiv:2312.00775, 2023a.
- Zero-shot robot manipulation from passive human videos. arXiv preprint arXiv:2302.02011, 2023b.
- Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918, 2023c.
- Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023a.
- Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2023b.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- A survey of visual affordance recognition based on deep learning. IEEE Transactions on Big Data, 2023.
- Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.
- Tactile-rl for insertion: Generalization to objects of unknown geometry. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6437–6443. IEEE, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023a.
- Video language planning. arXiv preprint arXiv:2310.10625, 2023b.
- Flowbot3d: Learning 3d articulation flow to manipulate articulated objects. arXiv preprint arXiv:2205.04382, 2022.
- Video prediction models as rewards for reinforcement learning. arXiv preprint arXiv:2305.14343, 2023.
- Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters, 5(2):492–499, 2019.
- Can pre-trained text-to-image models generate visual goals for reinforcement learning? arXiv preprint arXiv:2307.07837, 2023.
- kpam 2.0: Feedback control for category-level robotic manipulation. IEEE Robotics and Automation Letters, 6(2):2962–2969, 2021.
- First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 409–419, 2018a.
- First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 409–419, 2018b.
- Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7081–7091, 2023a.
- Rlafford: End-to-end affordance learning for robotic manipulation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5880–5886. IEEE, 2023b.
- James J Gibson. The theory of affordances. Hilldale, USA, 1(2):67–82, 1977.
- Efficient rl via disentangled environment and agent representations. arXiv preprint arXiv:2309.02435, 2023.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
- Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. arXiv preprint arXiv:2311.18259, 2023.
- Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023.
- Roboads: Anomaly detection against sensor and actuator misbehaviors in mobile robots. In 2018 48th Annual IEEE/IFIP international conference on dependable systems and networks (DSN), pages 574–585. IEEE, 2018.
- Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning, pages 3766–3777. PMLR, 2023.
- On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline. arXiv preprint arXiv:2212.05749, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016.
- Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842, 2023a.
- For pre-trained vision models in motor control, not all policy learning methods are created equal. arXiv preprint arXiv:2304.04591, 2023b.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022a.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022b.
- Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
- Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv preprint arXiv:2104.08212, 2021.
- Deft: Dexterous fine-tuning for real-world hand policies. arXiv preprint arXiv:2310.19797, 2023.
- Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Learning to Act from Actionless Video through Dense Correspondences. arXiv:2310.08576, 2023.
- H2o: Two hands manipulating objects for first person interaction recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10138–10148, October 2021.
- Controllable human-object interaction synthesis. arXiv preprint arXiv:2312.03913, 2023.
- Joint hand motion and interaction hotspots prediction from egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3282–3292, 2022a.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21013–21022, 2022b.
- Vip: Towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030, 2022.
- Liv: Language-image representations and rewards for robotic control. arXiv preprint arXiv:2306.00958, 2023.
- Where are we in the search for an artificial visual cortex for embodied intelligence? arXiv preprint arXiv:2303.18240, 2023.
- kpam: Keypoint affordances for category-level robotic manipulation. In The International Symposium of Robotics Research, pages 132–157. Springer, 2019.
- Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning. arXiv preprint arXiv:2009.05085, 2020.
- Structured world models from human videos. arXiv preprint arXiv:2308.10901, 2023.
- Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019.
- O2o-afford: Annotation-free large-scale object-object affordance learning. In Conference on Robot Learning, pages 1666–1677. PMLR, 2022.
- Part segmentation of unseen objects using keypoint guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1742–1750, 2021.
- R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
- Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023.
- Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
- Tax-pose: Task-specific cross-pose estimation for robot manipulation. In Conference on Robot Learning, pages 1783–1792. PMLR, 2023.
- Learning generalizable tool-use skills through trajectory generation. arXiv preprint arXiv:2310.00156, 2023.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017a.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017b.
- Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems, 35:23192–23204, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Toolflownet: Robotic manipulation with tools via predicting tool flow from point clouds. In Conference on Robot Learning, pages 1038–1049. PMLR, 2023.
- Understanding human hands in contact at internet scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9869–9878, 2020.
- Acid: Action-conditional implicit visual dynamics for deformable object manipulation. arXiv preprint arXiv:2203.06856, 2022.
- Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
- Neural descriptor fields: Se (3)-equivariant object representations for manipulation. In 2022 International Conference on Robotics and Automation (ICRA), pages 6394–6400. IEEE, 2022.
- Learning by watching via keypoint extraction and imitation learning. Machines, 10(11):1049, 2022.
- Robotap: Tracking arbitrary points for few-shot visual imitation. arXiv preprint arXiv:2308.15975, 2023.
- Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions. In European Conference on Computer Vision, pages 90–107. Springer, 2022.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Bundletrack: 6d pose tracking for novel objects without instance or category-level 3d models. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8067–8074. IEEE, 2021.
- Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 606–617, 2023a.
- Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025, 2023b.
- Fabricflownet: Bimanual cloth manipulation with a flow-based policy. In Conference on Robot Learning, pages 192–202. PMLR, 2022.
- Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- Vat-mart: Learning visual action trajectory proposals for manipulating 3d articulated objects. arXiv preprint arXiv:2106.14440, 2021.
- Learning foresightful dense visual affordance for deformable object manipulation. arXiv preprint arXiv:2303.11057, 2023.
- Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022.
- An affordance keypoint detection network for robot manipulation. IEEE Robotics and Automation Letters, 6(2):2870–2877, 2021.
- Foundation reinforcement learning: towards embodied generalist agents with foundation prior assistance. arXiv preprint arXiv:2310.02635, 2023.
- Flowbot++: Learning generalized articulated objects manipulation via articulation projection. arXiv preprint arXiv:2306.12893, 2023.
- Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. In European Conference on Computer Vision, pages 127–145. Springer, 2022.
- Keypoint-graph-driven learning framework for object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1065–1073, 2021.
- Fast segment anything. arXiv preprint arXiv:2306.12156, 2023.
- 3d implicit transporter for temporally consistent keypoint discovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3869–3880, 2023.
- Viola: Imitation learning for vision-based manipulation with object proposal priors. arXiv preprint arXiv:2210.11339, 2022. doi: 10.48550/arXiv.2210.11339.