Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement (2304.14391v4)
Abstract: Language is compositional; an instruction can express multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that generalizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language-instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual-LLM grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predicate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on established instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language-to-action reactive policies and LLM planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real-world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io.
- Grounding Language to Autonomously-Acquired Skills via Goal Generation. In ICLR 2021 - Ninth International Conference on Learning Representation, Vienna / Virtual, Austria, May 2021. URL https://hal.inria.fr/hal-03121146.
- Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
- Language as a cognitive tool to imagine goals in curiosity-driven exploration. CoRR, abs/2002.09253, 2020. URL https://arxiv.org/abs/2002.09253.
- Vygotskian autotelic artificial intelligence: Language and culture internalization for human-like ai, 2022. URL https://arxiv.org/abs/2206.01134.
- Li Dong and Mirella Lapata. Language to logical form with neural attention. arXiv preprint arXiv:1601.01280, 2016.
- Model based planning with energy based models. CoRR, abs/1909.06878, 2019. URL http://arxiv.org/abs/1909.06878.
- Compositional visual generation with energy based models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6637–6647. Curran Associates, Inc., 2020a. URL https://proceedings.neurips.cc/paper/2020/file/49856ed476ad01fcff881d57e161d73f-Paper.pdf.
- Improved contrastive divergence training of energy based models. CoRR, abs/2012.01316, 2020b. URL https://arxiv.org/abs/2012.01316.
- Example-based synthesis of 3d object arrangements. ACM Transactions on Graphics (TOG), 31(6):1–11, 2012.
- Stripstream: Integrating symbolic planners and blackbox samplers. CoRR, abs/1802.08705, 2018. URL http://arxiv.org/abs/1802.08705.
- Your classifier is secretly an energy based model and you should treat it like one. CoRR, abs/1912.03263, 2019. URL http://arxiv.org/abs/1912.03263.
- Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393, 2016.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022a.
- Inner monologue: Embodied reasoning through planning with language models, 2022b. URL https://arxiv.org/abs/2207.05608.
- Plasticinelab: A soft-body manipulation benchmark with differentiable physics. arXiv preprint arXiv:2104.03311, 2021.
- Bottom up top down detection transformers for language grounding in images and point clouds. In European Conference on Computer Vision, pages 417–433. Springer, 2022.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
- Image generation from scene graphs. CoRR, abs/1804.01622, 2018. URL http://arxiv.org/abs/1804.01622.
- Hierarchical task and motion planning in the now. In 2011 IEEE International Conference on Robotics and Automation, pages 1470–1477, 2011. doi: 10.1109/ICRA.2011.5980391.
- MDETR - modulated detection for end-to-end multi-modal understanding. CoRR, abs/2104.12763, 2021. URL https://arxiv.org/abs/2104.12763.
- Dall-e-bot: Introducing web-scale diffusion models to robotics. ArXiv, abs/2210.02438, 2022.
- Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Scott Kirkpatrick. Optimization by simulated annealing: Quantitative studies. Journal of Statistical Physics, 34:975–986, 1984.
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 123, 2016.
- Propagation networks for model-based control under partial observation. In 2019 International Conference on Robotics and Automation (ICRA), pages 1205–1211. IEEE, 2019.
- Code as policies: Language model programs for embodied control, 2022. URL https://arxiv.org/abs/2209.07753.
- On grounded planning for embodied tasks with language models, 2022. URL https://arxiv.org/abs/2209.00465.
- Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/abs/1405.0312.
- Learning to compose visual relations. CoRR, abs/2111.09297, 2021a. URL https://arxiv.org/abs/2111.09297.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. CoRR, abs/2107.13586, 2021b. URL https://arxiv.org/abs/2107.13586.
- Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects. arXiv preprint arXiv:2110.10189, 2021c.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Grounding language in play. arXiv preprint arXiv:2005.07648, 2020.
- SDRL: interpretable and data-efficient deep reinforcement learning leveraging symbolic planning. CoRR, abs/1811.00090, 2018. URL http://arxiv.org/abs/1811.00090.
- Generating images from captions with attention, 2015. URL https://arxiv.org/abs/1511.02793.
- kpam: Keypoint affordances for category-level robotic manipulation. In Robotics Research: The 19th International Symposium ISRR, pages 132–157. Springer, 2022.
- The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584, 2019.
- Interactive furniture layout using interior design guidelines. ACM transactions on graphics (TOG), 30(4):1–10, 2011.
- Object-centric task and motion planning in dynamic environments. CoRR, abs/1911.04679, 2019. URL http://arxiv.org/abs/1911.04679.
- Igor Mordatch. Concept learning with energy-based models. CoRR, abs/1811.02486, 2018. URL http://arxiv.org/abs/1811.02486.
- Visual reinforcement learning with imagined goals. CoRR, abs/1807.04742, 2018. URL http://arxiv.org/abs/1807.04742.
- Learning mesh-based simulation with graph networks. arXiv preprint arXiv:2010.03409, 2020.
- Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In Proc. ICCV, 2015.
- Skew-fit: State-covering self-supervised reinforcement learning. CoRR, abs/1903.03698, 2019. URL http://arxiv.org/abs/1903.03698.
- Scalable differentiable physics for learning and control. arXiv preprint arXiv:2007.02168, 2020.
- Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021. URL https://arxiv.org/abs/2103.00020.
- Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021. URL https://arxiv.org/abs/2102.12092.
- Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding, 2022. URL https://arxiv.org/abs/2205.11487.
- Learning to rearrange deformable cables, fabrics, and bags with goal-conditioned transporter networks. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4568–4575. IEEE, 2021.
- Cliport: What and where pathways for robotic manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL), 2021.
- Graph-structured visual imitation. In Conference on Robot Learning, pages 979–989. PMLR, 2020.
- Guiding multi-step rearrangement tasks with natural language instructions. In Aleksandra Faust, David Hsu, and Gerhard Neumann, editors, Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, pages 1486–1501. PMLR, 08–11 Nov 2022. URL https://proceedings.mlr.press/v164/stengel-eskin22a.html.
- Marc Toussaint. Logic-geometric programming: An optimization-based approach to combined task and motion planning. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, page 1930–1936. AAAI Press, 2015. ISBN 9781577357384.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Building a semantic parser overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1332–1342, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-1129. URL https://aclanthology.org/P15-1129.
- Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903, 2022. URL https://arxiv.org/abs/2201.11903.
- Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688. Citeseer, 2011.
- Transporters with visual foresight for solving unseen rearrangement tasks, 2022. URL https://arxiv.org/abs/2202.10765.
- Hyperdynamics: Meta-learning object and agent dynamics with hypernetworks. arXiv preprint arXiv:2103.09439, 2021.
- Fluidlab: A differentiable environment for benchmarking complex fluid manipulation. In International Conference on Learning Representations, 2023.
- A theory of generative convnet. ICML, 2016.
- Synthesizing dynamic patterns by spatial-temporal generative convnet. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1061–1069, 2017.
- Regression planning networks. CoRR, abs/1909.13072, 2019a. URL http://arxiv.org/abs/1909.13072.
- Energy-based continuous inverse optimal control. IEEE transactions on neural networks and learning systems, PP, 2019b.
- Scaling autoregressive models for content-rich text-to-image generation, 2022. URL https://arxiv.org/abs/2206.10789.
- Make it home: automatic optimization of furniture arrangement. ACM Transactions on Graphics (TOG)-Proceedings of ACM SIGGRAPH 2011, v. 30,(4), July 2011, article no. 86, 30(4), 2011.
- Transporter networks: Rearranging the visual world for robotic manipulation. arXiv preprint arXiv:2010.14406, 2020.
- Hierarchical planning for long-horizon manipulation with geometric and symbolic scene graphs. CoRR, abs/2012.07277, 2020. URL https://arxiv.org/abs/2012.07277.