Compositional Foundation Models for Hierarchical Planning (2309.08587v2)

Published 15 Sep 2023 in cs.LG, cs.AI, and cs.RO

Abstract: To make effective decisions in novel environments with long-horizon goals, it is crucial to engage in hierarchical reasoning across spatial and temporal scales. This entails planning abstract subgoal sequences, visually reasoning about the underlying plans, and executing actions in accordance with the devised plan through visual-motor control. We propose Compositional Foundation Models for Hierarchical Planning (HiP), a foundation model which leverages multiple expert foundation model trained on language, vision and action data individually jointly together to solve long-horizon tasks. We use a LLM to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos. To enable effective reasoning within this hierarchy, we enforce consistency between the models via iterative refinement. We illustrate the efficacy and adaptability of our approach in three different long-horizon table-top manipulation tasks.

Citations (44)

View on Semantic Scholar

Summary

The paper introduces HiP, a compositional framework integrating language, vision, and action models for hierarchical planning.
It employs a three-tier methodology to decompose goals, generate visual trajectories, and deduce actions with iterative refinement.
Experimental results demonstrate HiP’s superior performance and adaptability on long-horizon manipulation tasks in novel environments.

Compositional Foundation Models for Hierarchical Planning

This paper introduces Compositional Foundation Models for Hierarchical Planning (HiP), a novel approach leveraging foundational models to address the challenges of hierarchical reasoning in decision-making tasks, particularly for novel environments with long-horizon goals. HiP synergizes pre-trained expert foundation models across language, vision, and action modalities to execute tasks requiring hierarchical planning without requiring extensive paired data for training.

Research Background and Motivation

To tackle the issue of decision-making in unfamiliar settings, agents must integrate reasoning across multiple levels: abstract task planning, visual reasoning, and visual-motor control. Traditional large-scale models in natural language processing and vision generally require large datasets, which are often expensive and impractical to procure, especially paired datasets involving language, vision, and action. The key focus is learning how foundational models trained separately on language, vision, and action data can be composed to jointly solve long-horizon tasks efficiently.

Methodology

HiP adopts a three-tier hierarchical framework:

Task Planning: Utilizes LLMs to decompose the provided high-level goals into sub-goals.
Visual Planning: Employs a video diffusion model to generate plausible subgoal trajectories given the sub-goal and initial observation.
Action Planning: Leverages a pre-trained inverse dynamics model to deduce the required actions from the video-generated trajectories.

To maintain the consistency and maximize the utility of the model's decisions, an iterative refinement process is applied. This procedure uses feedback mechanisms to ensure coherence among the decisions of the LLM, visual model, and action model through likelihood optimization.

Experimental Evaluation

The practicality of HiP is demonstrated via three different long-horizon manipulation tasks: paint-block, object-arrange, and kitchen-tasks. In experimental evaluations, HiP consistently outperformed several baseline methods across all domains, indicating its robustness and adaptability to unseen scenarios. A notable highlight is the superior performance when confronted with novel task compositions, emphasizing the efficacy of leveraging learned hierarchical planning without task-specific paired data.

Implications and Future Directions

The proposed HiP framework opens potential avenues in AI towards developing adaptive, scalable decision-making systems through the integration of varying foundation models. The iterative feedback mechanism illustrates a feasible methodology for maintaining consistency in multi-modal model compositions. Future work could further enhance these systems by scaling up the models involved, potentially incorporating newer advancements in diffusion models or increasing available computational resources, which may lead to even more efficient systems.

Speculative future enhancements could meaningfully extend to include other sensory modalities such as audio or haptics, thereby further broadening the applicability and effectiveness of such systems in complex, real-world scenarios.

Conclusion

HiP presents a significant contribution towards addressing the challenges associated with multilevel planning and decision-making, showcasing the potential benefits of compositive architectures in AI. While the ongoing development of foundation models continues, HiP serves as a paradigm that proficiently utilizes these modular components for efficient hierarchical planning, providing safeguards against the limitations posed by the need for extensive paired training datasets.

PDF Markdown

Related Papers

Tweets

https://twitter.com/TechXplore_com/status/1744476410654200076

https://twitter.com/emilyzsh/status/1776688082357735570

YouTube

Show All Videos