Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RoboDreamer: Learning Compositional World Models for Robot Imagination (2404.12377v1)

Published 18 Apr 2024 in cs.RO

Abstract: Text-to-video models have demonstrated substantial potential in robotic decision-making, enabling the imagination of realistic plans of future actions as well as accurate environment simulation. However, one major issue in such models is generalization -- models are limited to synthesizing videos subject to language instructions similar to those seen at training time. This is heavily limiting in decision-making, where we seek a powerful world model to synthesize plans of unseen combinations of objects and actions in order to solve previously unseen tasks in new environments. To resolve this issue, we introduce RoboDreamer, an innovative approach for learning a compositional world model by factorizing the video generation. We leverage the natural compositionality of language to parse instructions into a set of lower-level primitives, which we condition a set of models on to generate videos. We illustrate how this factorization naturally enables compositional generalization, by allowing us to formulate a new natural language instruction as a combination of previously seen components. We further show how such a factorization enables us to add additional multimodal goals, allowing us to specify a video we wish to generate given both natural language instructions and a goal image. Our approach can successfully synthesize video plans on unseen goals in the RT-X, enables successful robot execution in simulation, and substantially outperforms monolithic baseline approaches to video generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
  2. Compositional foundation models for hierarchical planning. arXiv preprint arXiv:2309.08587, 2023.
  3. Compositional video synthesis with action graphs. arXiv preprint arXiv:2006.15327, 2020.
  4. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  5. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
  6. Residual energy-based models for text generation. arXiv preprint arXiv:2004.11714, 2020.
  7. Unsupervised learning of compositional energy concepts. Advances in Neural Information Processing Systems, 34:15608–15620, 2021.
  8. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International Conference on Machine Learning, pp.  8489–8510. PMLR, 2023a.
  9. Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023b.
  10. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  11. Energy-based models as zero-shot planners for compositional scene rearrangement. arXiv preprint arXiv:2304.14391, 2023.
  12. Instruction-driven history-aware policies for robotic manipulations. In Conference on Robot Learning, pp.  175–187. PMLR, 2023.
  13. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  14. Instruct-imagen: Image generation with multi-modal instruction. arXiv preprint arXiv:2401.01952, 2024.
  15. Diffusion-based generation, optimization, and planning in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16750–16761, 2023.
  16. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
  17. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020.
  18. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
  19. Multilingual constituency parsing with self-attention and pre-training. arXiv preprint arXiv:1812.11760, 2018.
  20. Learning to act from actionless videos through dense correspondences, 2023.
  21. Adaptdiffuser: Diffusion models as adaptive self-evolving planners. In International Conference on Machine Learning, 2023.
  22. Composable text controls in latent space with ODEs. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  16543–16570, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.1030. URL https://aclanthology.org/2023.emnlp-main.1030.
  23. Learning to compose visual relations. Advances in Neural Information Processing Systems, 34:23166–23178, 2021.
  24. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pp.  423–439. Springer, 2022.
  25. Structdiffusion: Language-guided creation of physically-valid structures using unseen objects. In RSS 2023, 2023b.
  26. Controllable and compositional generation with latent-space energy-based models. Advances in Neural Information Processing Systems, 34:13497–13510, 2021.
  27. Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677, 2023.
  28. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  29. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  30. Exploring compositional visual generation with latent classifier guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  853–862, 2023.
  31. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  32. Probabilistic adaptation of text-to-video models. arXiv preprint arXiv:2306.01872, 2023a.
  33. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023b.
  34. Modular action concept grounding in semantic video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3605–3614, 2022.
  35. Lad: Language augmented diffusion for reinforcement learning. In Second Workshop on Language and Reinforcement Learning, 2022.
  36. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3836–3847, 2023.
  37. Adaptive online replanning with diffusion models. arXiv preprint arXiv:2310.09629, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Siyuan Zhou (27 papers)
  2. Yilun Du (113 papers)
  3. Jiaben Chen (12 papers)
  4. Yandong Li (38 papers)
  5. Dit-Yan Yeung (78 papers)
  6. Chuang Gan (196 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.