Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training (2402.14407v4)
Abstract: Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks and interactions with the physical world. Promising prospects arise for utilizing actionless human videos for pre-training and transferring the knowledge to facilitate robot policy learning through limited robot demonstrations. However, it remains a challenge due to the domain gap between humans and robots. Moreover, it is difficult to extract useful information representing the dynamic world from human videos, because of its noisy and multimodal data structure. In this paper, we introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. We start by compressing both human and robot videos into unified video tokens. In the pre-training stage, we employ a discrete diffusion model with a mask-and-replace diffusion strategy to predict future video tokens in the latent space. In the fine-tuning stage, we harness the imagined future videos to guide low-level action learning with a limited set of robot data. Experiments demonstrate that our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches with superior performance. Our project website is available at https://video-diff.github.io/.
- Is conditional generative modeling all you need for decision making? In International Conference on Learning Representations, 2023.
- Structured denoising diffusion models in discrete state-spaces, 2023.
- Layer normalization. CoRR, abs/1607.06450, 2016. URL http://arxiv.org/abs/1607.06450.
- Human-to-robot imitation in the wild. 2022.
- Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13778–13790, 2023.
- Zero-shot robot manipulation from passive human videos. arXiv preprint arXiv:2302.02011, 2023a.
- Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In First Workshop on Out-of-Distribution Generalization in Robotics at CoRL 2023, 2023b.
- Robotic offline RL from internet videos via value-function pre-training. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- Robocat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023.
- Rt-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems, 2023.
- What makes pre-trained visual representations successful for robust manipulation? arXiv preprint arXiv:2312.12444, 2023.
- Learning generalizable robotic reward functions from” in-the-wild” human videos. arXiv preprint arXiv:2103.16817, 2021a.
- Offline reinforcement learning via high-fidelity generative behavior modeling. In International Conference on Learning Representations, 2023.
- Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021b.
- Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
- Collaboration, O. X. Open x-embodiment: Robotic learning datasets and RT-X models. CoRR, abs/2310.08864, 2023.
- The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(11):4125–4141, 2021. doi: 10.1109/TPAMI.2020.2991965.
- Aligndiff: Aligning diverse human preferences via behavior-customisable diffusion model. In International Conference on Learning Representations, 2024.
- Learning universal policies via text-guided video generation. In Neural Information Processing Systems, 2023.
- Video prediction models as rewards for reinforcement learning. In Neural Information Processing Systems, 2023.
- Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 3:5, 2023.
- Demo2vec: Reasoning object affordances from online videos. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2139–2147, 2018.
- RVT: Robotic view transformer for 3d object manipulation. In 7th Annual Conference on Robot Learning, 2023.
- Human hands as probes for interactive object understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3293–3303, 2022.
- The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp. 5842–5850, 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012, 2022.
- Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. arXiv preprint arXiv:2311.18259, 2023.
- Vector quantized diffusion model for text-to-image synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10696–10706, 2022.
- Meta-reinforcement learning of structured exploration strategies. Advances in neural information processing systems, 31, 2018.
- Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2020.
- Mastering atari with discrete world models. In International Conference on Learning Representations, 2021.
- Temporal difference learning for model predictive control. In International Conference on Machine Learning, pp. 8387–8406. PMLR, 2022.
- Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning. In Neural Information Processing Systems, 2023.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Argmax flows and multinomial diffusion: Learning categorical distributions. In Advances in Neural Information Processing Systems, 2021.
- Unified discrete diffusion for simultaneous vision-language generation. In International Conference on Learning Representations, 2023.
- Diffusion reward: Learning rewards via conditional video diffusion. arXiv preprint arXiv:2312.14134, 2023.
- Soda: Bottleneck diffusion models for representation learning. arXiv preprint arXiv:2311.17901, 2023.
- Human-oriented representation learning for robotic manipulation, 2023.
- Perceiver IO: A general architecture for structured inputs & outputs. In International Conference on Learning Representations, 2022.
- Q-attention: Enabling efficient learning for vision-based robotic manipulation. IEEE Robotics and Automation Letters, 7(2):1612–1619, 2022.
- Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020.
- Coarse-to-fine q-attention: Efficient learning for visual robotic manipulation via discretisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13739–13748, 2022.
- Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
- Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, 2022.
- Exploring visual pre-training for robot manipulation: Datasets, models and methods. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 11390–11395. IEEE, 2023.
- Language-driven representation learning for robotics. In Robotics: Science and Systems (RSS), 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Learning to act from actionless videos through dense correspondences. In International Conference on Learning Representations, 2024.
- Multi-game decision transformers. Advances in Neural Information Processing Systems, 35:27921–27936, 2022.
- Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution. arXiv preprint arXiv:2312.11598, 2023.
- Spawnnet: Learning generalizable visuomotor skills from pre-trained networks, 2023.
- Joint hand motion and interaction hotspots prediction from egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3282–3292, 2022.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
- LIV: Language-image representations and rewards for robotic control. In International Conference on Machine Learning, volume 202, pp. 23301–23320, 2023a.
- VIP: Towards universal visual reward and representation via value-implicit pre-training. In International Conference on Learning Representations, 2023b.
- Structured world models from human videos. 2023.
- EmbodiedGPT: Vision-language pre-training via embodied chain of thought. In Neural Information Processing Systems, 2023.
- R3m: A universal visual representation for robot manipulation. In 6th Annual Conference on Robot Learning, 2022.
- Metadiffuser: Diffusion model as conditional planner for offline meta-rl. 2023.
- OpenAI et al. Gpt-4 technical report, 2023.
- Imitating human behaviour with diffusion models. In International Conference on Learning Representations, 2023.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- Language models are unsupervised multitask learners. 2019.
- Learning transferable visual models from natural language supervision, 2021.
- Real-world robot learning with masked visual pre-training. CoRL, 2022.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211 – 252, 2014.
- Photorealistic text-to-image diffusion models with deep language understanding, 2022.
- Reinforcement learning with action-free pre-training from videos. In International Conference on Machine Learning, pp. 19561–19579. PMLR, 2022.
- Videodex: Learning dexterity from internet videos. In 6th Annual Conference on Robot Learning, 2022. URL https://openreview.net/forum?id=qUhkhHw8Dz.
- Perceiver-actor: A multi-task transformer for robotic manipulation. In 6th Annual Conference on Robot Learning, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics, 2015.
- Paco: Parameter-compositional multi-task reinforcement learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.
- SMART: Self-supervised multi-task pretraining with control transformers. In International Conference on Learning Representations, 2023.
- Investigating multi-task pretraining and generalization in reinforcement learning. In International Conference on Learning Representations, 2023.
- Plex: Making the most of the available data for robotic manipulation pretraining. In Conference on Robot Learning, pp. 2624–2641. PMLR, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Neural discrete representation learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pp. 1723–1736, 2023a.
- Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pp. 1723–1736. PMLR, 2023b.
- Mimicplay: Long-horizon imitation learning by watching human play. arXiv preprint arXiv:2302.12422, 2023a.
- Diffusion policies as an expressive policy class for offline reinforcement learning. In International Conference on Learning Representations, 2023b.
- Unleashing large-scale video generative pre-training for visual robot manipulation. In International Conference on Learning Representations, 2024.
- Pre-training contextualized world models with in-the-wild videos for reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=8GuEVzAUQS.
- Chaineddiffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation. In 7th Annual Conference on Robot Learning, 2023.
- Masked visual pre-training for motor control. arXiv:2203.06173, 2022.
- Decomposing the generalization gap in imitation learning for visual robotic manipulation. arXiv preprint arXiv:2307.03659, 2023.
- Learning by watching: Physical imitation of manipulation skills from human videos. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7827–7834. IEEE, 2021.
- Videogpt: Video generation using vq-vae and transformers, 2021.
- Learning interactive real-world simulators. In International Conference on Learning Representations, 2024.
- Multi-task reinforcement learning with soft modularization. Advances in Neural Information Processing Systems, 33:4767–4777, 2020.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pp. 1094–1100. PMLR, 2020.
- Robust task representations for offline meta-reinforcement learning via contrastive learning. In International Conference on Machine Learning, pp. 25747–25759. PMLR, 2022.
- Generative planning for temporally coordinated exploration in reinforcement learning. In International Conference on Learning Representations, 2022.
- Haoran He (21 papers)
- Chenjia Bai (47 papers)
- Ling Pan (41 papers)
- Weinan Zhang (322 papers)
- Bin Zhao (107 papers)
- Xuelong Li (268 papers)