Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos (2409.20557v1)

Published 30 Sep 2024 in cs.CV

Abstract: Goal-oriented planning, or anticipating a series of actions that transition an agent from its current state to a predefined objective, is crucial for developing intelligent assistants aiding users in daily procedural tasks. The problem presents significant challenges due to the need for comprehensive knowledge of temporal and hierarchical task structures, as well as strong capabilities in reasoning and planning. To achieve this, prior work typically relies on extensive training on the target dataset, which often results in significant dataset bias and a lack of generalization to unseen tasks. In this work, we introduce VidAssist, an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos. VidAssist leverages LLMs as both the knowledge base and the assessment tool for generating and evaluating action plans, thus overcoming the challenges of acquiring procedural knowledge from small-scale, low-diversity datasets. Moreover, VidAssist employs a breadth-first search algorithm for optimal plan generation, in which a composite of value functions designed for goal-oriented planning is utilized to assess the predicted actions at each step. Extensive experiments demonstrate that VidAssist offers a unified framework for different goal-oriented planning setups, e.g., visual planning for assistance (VPA) and procedural planning (PP), and achieves remarkable performance in zero-shot and few-shot setups. Specifically, our few-shot model outperforms the prior fully supervised state-of-the-art method by +7.7% in VPA and +4.81% PP task on the COIN dataset while predicting 4 future actions. Code, and models are publicly available at https://sites.google.com/view/vidassist.

PDF HTML Abstract

VidAssist: Leveraging LLMs for Goal-Oriented Planning in Instructional Videos

The paper "Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos" presents VidAssist, an innovative framework designed to address the challenges of zero- and few-shot learning in goal-oriented planning tasks within instructional videos. This paper offers a novel integration of LLMs with a search-based algorithm to enhance the generation and optimization of action plans.

Problem Definition

Goal-oriented planning in instructional videos involves forecasting a sequence of actions to transition from a current state to a predefined goal based on visual observations. The task can be instantiated into two setups: Visual Planning for Assistance (VPA) and Procedural Planning (PP). In VPA, the system must predict future action steps given an untrimmed video history and a goal described in natural language. In PP, the task involves generating steps from an initial state image to a goal state image.

Methodological Framework

VidAssist employs a three-step process: Propose, Assess, and Search.

Propose:

At each step, VidAssist leverages LLMs to generate multiple possible subsequent actions based on the current observation and goal. These actions are sampled to account for the uncertainty inherent in procedural tasks.

Assess:

The proposed actions are evaluated using a composite of value functions:

Text Generation Score: Evaluates the likelihood of an action description generated by the LLM.
Text Mapping Score: Measures the confidence of mapping a free-form LLM output to an admissible action.
Partial Plan Evaluation: Uses LLMs to assess the coherence and viability of the predicted action steps toward achieving the goal.
Few-shot Task Graph: Utilizes transition probabilities derived from few-shot examples to guide the action selection.

Search:

A breadth-first search (BFS) algorithm is employed to identify the optimal action plan based on the assessed scores. Low-scoring actions are pruned dynamically for efficiency, ensuring that the search space remains manageable.

Experimental Results

The framework was evaluated on COIN and CrossTask datasets, with tasks ranging in action prediction horizons from 1 to 4 steps. Noteworthy results include:

In the zero-shot setup, VidAssist surpassed the LLM baseline by 12.9% and 6.6% success rate (SR) on COIN and CrossTask datasets, respectively, for a planning horizon of 3 future steps.
The few-shot model extended this lead, outperforming supervised state-of-the-art methods by up to 7.7% SR on COIN for a planning horizon of 4 steps.

These results underscore the efficacy of VidAssist's deliberate planning mechanism, particularly its ability to generalize from limited annotated data, which is a significant improvement over standard LLM-based techniques.

Implications and Future Work

VidAssist's approach of integrating LLMs as both knowledge bases and assessment tools offers promising implications for the development of intelligent planning systems, particularly in contexts where annotated data is scarce or expensive to obtain. The interdisciplinary nature of this framework, combining advancements in LLMs with deliberate search methods, highlights its potential applicability beyond instructional videos to other domains requiring complex procedural planning.

Future developments may aim to enhance the visual understanding components of VidAssist, given that failures often arise from incorrect visual input processing. Additionally, leveraging more powerful LLMs and refining search algorithms and value functions could further improve the robustness and performance of the system. Extending the application domain to real-world embodied AI systems, such as personal assistants and robots, will be a natural progression for this research.

Conclusion

VidAssist demonstrates a significant step forward in harnessing the potential of LLMs for goal-oriented planning in instructional videos. Through the innovative propose-assess-search methodology, it addresses key challenges in zero- and few-shot learning scenarios, setting a new standard for future research in intelligent procedural planning.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Md Mohaiminul Islam (13 papers)
Tushar Nagarajan (33 papers)
Huiyu Wang (38 papers)
Fu-Jen Chu (16 papers)
Kris Kitani (96 papers)
Gedas Bertasius (55 papers)
Xitong Yang (27 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/gberta227/status/1841656308384334022

YouTube

Show All Videos