Multimodal Procedural Planning via Dual Text-Image Prompting (2305.01795v1)

Published 2 May 2023 in cs.CL

Abstract: Embodied agents have achieved prominent performance in following human instructions to complete tasks. However, the potential of providing instructions informed by texts and images to assist humans in completing tasks remains underexplored. To uncover this capability, we present the multimodal procedural planning (MPP) task, in which models are given a high-level goal and generate plans of paired text-image steps, providing more complementary and informative guidance than unimodal plans. The key challenges of MPP are to ensure the informativeness, temporal coherence,and accuracy of plans across modalities. To tackle this, we propose Text-Image Prompting (TIP), a dual-modality prompting method that jointly leverages zero-shot reasoning ability in LLMs and compelling text-to-image generation ability from diffusion-based models. TIP improves the interaction in the dual modalities using Text-to-Image Bridge and Image-to-Text Bridge, allowing LLMs to guide the textual-grounded image plan generation and leveraging the descriptions of image plans to ground the textual plan reversely. To address the lack of relevant datasets, we collect WIKIPLAN and RECIPEPLAN as a testbed for MPP. Our results show compelling human preferences and automatic scores against unimodal and multimodal baselines on WIKIPLAN and RECIPEPLAN in terms of informativeness, temporal coherence, and plan accuracy. Our code and data: https://github.com/YujieLu10/MPP.

Citations (37)

View on Semantic Scholar

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces the TIP framework that combines LLMs and diffusion models to generate coherent, action-guiding multimodal plans.
It employs bidirectional bridges to align textual and visual outputs, enhancing temporal coherence and informativeness.
Empirical results on WikiPlan and RecipePlan datasets show TIP outperforms baselines by over 60% in human preference and automatic metrics.

Multimodal Procedural Planning via Dual Text-Image Prompting: An Expert Overview

The paper "Multimodal Procedural Planning via Dual Text-Image Prompting" investigates the potential of multimodal models in generating coherent and actionable text-image plans for procedural tasks, marking a step towards multimodal AI systems that aid in task execution by humans. The researchers introduce and formalize the Multimodal Procedural Planning (MPP) task, aiming for the generation of text-image sequences that enhance the informativeness and accuracy when assisting users in task completion, compared to unimodal approaches.

Central to this paper is the proposed Text-Image Prompting (TIP) framework, a dual-modality prompting approach. TIP strategically integrates the capabilities of LLMs and diffusion-based text-to-image models to address three core challenges identified in the paper: informativeness, temporal coherence, and plan accuracy. By leveraging zero-shot reasoning and sophisticated language comprehension of LLMs alongside the powerful image generation capabilities of diffusion models, TIP seeks to generate well-aligned multimodal plans.

The paper involves the creation of two datasets, WikiPlan and RecipePlan, curated to serve as benchmarks for the MPP task. These datasets support TIP by providing a diverse array of tasks ranging from cooking to general instructions, each supported with corresponding multimodal references. The empirical evaluations demonstrate that TIP markedly improves on both human preference metrics and automatic evaluation scores compared to existing baselines, such as Text-Davinci with Stable-Diffusion and other state-of-the-art single-modality and multimodal baselines.

Notable quantitative results from human evaluations show that TIP exceeds baselines by a win percentage exceeding 60% across metrics such as textual and visual informativeness, temporal coherence, and overall plan accuracy. Automatic metrics also align with these findings, indicating TIP's robustness across various tasks within the testbeds, appreciably outperforming baselines even those utilizing task references.

A distinct feature of TIP is its two bridging components. The Text-to-Image Bridge (T2I-B) enables text-grounded image generation by refining textual inputs to the T2I models, ensuring alignment with task semantics. Conversely, the Image-to-Text Bridge (I2T-B) leverages image captions to refine and revise text plans, aligning them better with visual contexts, thus facilitating bidirectional contextual grounding in multimodal plans.

The implications of this research are manifold. Practically, TIP brings forth powerful capabilities for AI systems in assisting with instruction-based tasks, enhancing human-computer interaction through richer, more informative guidance. Theoretically, it paves the way for exploration into more nuanced capabilities of LLMs in multimodal contexts, encouraging further research into synergizing language and vision models for procedural reasoning and planning.

The research leaves room for future exploration, particularly in the fine-tuning and adaptation of LLMs and T2I models to better coalesce these capabilities more seamlessly within existing and emerging architectures. The limitation of current metrics in fully capturing the efficacy of such multimodal planning methods also highlights the need for further development in evaluation methodologies to more accurately assess the capabilities of multimodal models. As AI continues to advance, this work lays a structured foundation for further exploration into the collaborative potential of language and vision models in empowering AI agents in diverse, real-world tasks.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (6)

GitHub

GitHub - YujieLu10/TIP: Multimodal-Procedural-Planning (92 stars)

Tweets

https://twitter.com/yujielu_10/status/1837583930809372680