Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Planning as In-Painting: A Diffusion-Based Embodied Task Planning Framework for Environments under Uncertainty (2312.01097v1)

Published 2 Dec 2023 in cs.CV, cs.LG, and cs.RO

Abstract: Task planning for embodied AI has been one of the most challenging problems where the community does not meet a consensus in terms of formulation. In this paper, we aim to tackle this problem with a unified framework consisting of an end-to-end trainable method and a planning algorithm. Particularly, we propose a task-agnostic method named 'planning as in-painting'. In this method, we use a Denoising Diffusion Model (DDM) for plan generation, conditioned on both language instructions and perceptual inputs under partially observable environments. Partial observation often leads to the model hallucinating the planning. Therefore, our diffusion-based method jointly models both state trajectory and goal estimation to improve the reliability of the generated plan, given the limited available information at each step. To better leverage newly discovered information along the plan execution for a higher success rate, we propose an on-the-fly planning algorithm to collaborate with the diffusion-based planner. The proposed framework achieves promising performances in various embodied AI tasks, including vision-language navigation, object manipulation, and task planning in a photorealistic virtual environment. The code is available at: https://github.com/joeyy5588/planning-as-inpainting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  2. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
  3. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018.
  4. Rearrangement: A challenge for embodied ai. arXiv preprint arXiv:2011.01975, 2020.
  5. A persistent spatial semantic representation for high-level natural language instruction execution. In Conference on Robot Learning, pages 706–717. PMLR, 2022.
  6. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  7. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  8. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5:411–444, 2022.
  9. Open-vocabulary queryable scene representations for real world planning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11509–11522. IEEE, 2023.
  10. History aware multimodal transformer for vision-and-language navigation. In NeurIPS, 2021.
  11. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
  12. Guiding policies with language via meta-learning. In International Conference on Learning Representations, 2019.
  13. Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023.
  14. Embodied question answering. In CVPR, 2018.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
  16. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  17. One-shot imitation learning. In Advances in neural information processing systems, 2017.
  18. Manipulathor: A framework for visual object manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4497–4506, 2021.
  19. Surreal: Open-source reinforcement learning framework and robot manipulation benchmark. In Conference on Robot Learning, pages 767–782. PMLR, 2018.
  20. Secant: Self-expert cloning for zero-shot generalization of visual policies. arXiv preprint arXiv:2106.09678, 2021.
  21. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
  22. The threedworld transport challenge: A visually guided task-and-motion planning benchmark for physically realistic embodied ai. arXiv preprint arXiv:2103.14025, 2021.
  23. Alexa arena: A user-centric interactive platform for embodied ai. arXiv preprint arXiv:2303.01586, 2023.
  24. World models. arXiv preprint arXiv:1803.10122, 2018.
  25. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
  26. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
  27. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  28. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023a.
  29. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022a.
  30. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022b.
  31. Grounded decoding: Guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855, 2023b.
  32. Prompter: Utilizing large language model prompting for a data efficient embodied instruction following. arXiv preprint arXiv:2211.03267, 2022.
  33. Stay on the path: Instruction fidelity in vision-and-language navigation. In ACL, 2019.
  34. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020.
  35. Representation learning for grounded spatial reasoning. In Transactions of the Association for Computational Linguistics, pages 49–61. MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2018.
  36. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
  37. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2022.
  38. Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6741–6749, 2019.
  39. Compile: Compositional imitation learning and execution. In International Conference on Machine Learning (ICML), 2019.
  40. Pathdreamer: A world model for indoor navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14738–14748, 2021.
  41. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
  42. Beyond the nav-graph: Vision-and-language navigation in continuous environments. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, pages 104–120. Springer, 2020.
  43. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In EMNLP, 2020.
  44. Conservative q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, pages 1179–1191. Curran Associates, Inc., 2020.
  45. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
  46. Robust navigation with language pretraining and stochastic sampling. In EMNLP, 2019.
  47. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
  48. Film: Following instructions in language with modular methods. arXiv preprint arXiv:2110.07342, 2021.
  49. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  50. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  51. Shaking the foundations: delusions in sequence models for interaction and control. arXiv preprint arXiv:2110.10819, 2021.
  52. Episodic transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15942–15952, 2021.
  53. Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677, 2023.
  54. Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023.
  55. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  56. The fast research interface for the kuka lightweight robot. In IEEE workshop on innovative robot control architectures for demanding (Research) applications how to modify and enhance commercial controllers (ICRA 2010), pages 15–21. Citeseer, 2010.
  57. Analyzing generalization of vision and language navigation to unseen outdoor areas. In ACL, 2022.
  58. Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663, 2022.
  59. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
  60. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
  61. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
  62. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  63. Nomad: Goal masked diffusion policies for navigation and exploration. arXiv preprint arXiv:2310.07896, 2023.
  64. Universal planning networks: Learning generalizable representations for visuomotor control. In International Conference on Machine Learning, pages 4732–4741. PMLR, 2018.
  65. Learning to be safe: Deep rl with a safety critic. arXiv preprint arXiv:2010.14603, 2020.
  66. Language-conditioned imitation learning for robot manipulation tasks. Advances in Neural Information Processing Systems, 33:13139–13150, 2020.
  67. Embodied bert: A transformer model for embodied, language-guided visual task completion. arXiv preprint arXiv:2108.04927, 2021.
  68. Program guided agent. In International Conference on Learning Representations, 2019.
  69. Habitat 2.0: Training home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems, pages 251–266, 2021.
  70. Learning to navigate unseen environments: Back translation with environmental dropout. In NAACL-HLT, 2019.
  71. Russ Tedrake. Underactuated robotics. 2022. URL https://underactuated. csail. mit. edu, page 1.
  72. Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems, 2017.
  73. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
  74. Visual room rearrangement. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  75. Lacma: Language-aligning contrastive learning with meta-actions for embodied instruction following. arXiv preprint arXiv:2310.12344, 2023.
  76. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  77. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  78. Babywalk: Going farther in vision-and-language navigation by taking baby steps. In ACL, 2020a.
  79. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3357–3364. IEEE, 2017.
  80. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020b.
  81. Self-motivated communication agent for real-world vision-dialog navigation. In ICCV, 2021.
Citations (6)

Summary

  • The paper introduces a novel 'planning as in-painting' approach that fills in missing action steps from partial observations to enhance planning under uncertainty.
  • It leverages diffusion models to predict both goal states and intermediate actions, outperforming traditional reinforcement learning and generative methods across simulations.
  • An on-the-fly planning algorithm continuously updates action plans by balancing exploration and exploitation, significantly boosting success in dynamic environments.

Diffusion Models for Embodied AI Planning

Introduction to Embodied AI Planning

Embodied AI involves creating intelligent agents that can interact within and manipulate their environment. This field includes sub-domains like robotics, vision-language navigation, and machine learning. Unlike virtual AI, embodied AI must deal with the unpredictability and partial observability of real-world settings. Historically, the strategies for teaching such AI involved either following expert demonstrations or through trial-and-error learning methods like reinforcement learning (RL). However, these methods have limitations, particularly in terms of flexibility and efficiently handling uncertainty.

A Novel Approach to Embodied Task Planning

To enhance the capabilities of embodied AI in uncertain environments, there's been a significant shift toward using what's known as Diffusion Models. A diffusion model is a type of generative model that learns to recover structured data from randomness. This characteristic seems ideal for planning in uncertain conditions because it mirrors the process of making a sequence of actions from an initial state to achieve a goal.

The innovation presented in this paper is a method called "planning as in-painting," which applies the concept of diffusion models to Embodied AI planning. The idea is similar to filling in missing pieces of an image, but in this case, it's about filling in the steps of an action plan when only partial information is available. The model works by predicting not just the sequence of states (action trajectory) but also a goal state directly from sensory inputs and language instructions. This dual prediction can enhance the plan's reliability because the agent isn't just reacting to the present but is also guided by an understanding of where it needs to go eventually.

Complex and Realistic Task Experiments

The effectiveness of "planning as in-painting" was put to the test across various AI tasks, from simple navigations in a grid-world to manipulation tasks with a robotic arm, and finally, in a photorealistic virtual environment. In these controlled simulations, the model consistently outperformed traditional RL-based approaches and other generative policy methods, showcasing its versatility and adaptability to different planning challenges.

On-the-Fly Planning Algorithm

To take full advantage of newly revealed environmental information as plans unfold, an on-the-fly planning algorithm was introduced. This algorithm allows the model to continuously update its plans as new information becomes available while balancing exploration and exploitation strategies. The empirical evaluations demonstrated that this strategy significantly boosts success rates in environments where the agent doesn't have full observability.

Looking Forward: Limitations and Potential

While the presented framework showed promise, limitations remain. The complexity of real-world instructions and varying linguistic prompts can affect performance, indicating the necessity for more sophisticated language processing. Additionally, the current implementation deals with two-dimensional planning spaces, suggesting that future expansions into three-dimensional planning could unlock more capabilities. Another aspect in need of improvement is the computational intensity of on-the-fly planning, which could be optimized for more efficient real-world application.

In summary, "planning as in-painting" marks a step forward in embodied AI task planning, opening new avenues for research and development. The framework leverages the flexibility of diffusion models and demonstrates strong potential for creating AI agents capable of navigating and reasoning in complex, incomplete, and dynamic environments.