VidAssist: Leveraging LLMs for Goal-Oriented Planning in Instructional Videos
The paper "Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos" presents VidAssist, an innovative framework designed to address the challenges of zero- and few-shot learning in goal-oriented planning tasks within instructional videos. This paper offers a novel integration of LLMs with a search-based algorithm to enhance the generation and optimization of action plans.
Problem Definition
Goal-oriented planning in instructional videos involves forecasting a sequence of actions to transition from a current state to a predefined goal based on visual observations. The task can be instantiated into two setups: Visual Planning for Assistance (VPA) and Procedural Planning (PP). In VPA, the system must predict future action steps given an untrimmed video history and a goal described in natural language. In PP, the task involves generating steps from an initial state image to a goal state image.
Methodological Framework
VidAssist employs a three-step process: Propose, Assess, and Search.
Propose:
At each step, VidAssist leverages LLMs to generate multiple possible subsequent actions based on the current observation and goal. These actions are sampled to account for the uncertainty inherent in procedural tasks.
Assess:
The proposed actions are evaluated using a composite of value functions:
- Text Generation Score: Evaluates the likelihood of an action description generated by the LLM.
- Text Mapping Score: Measures the confidence of mapping a free-form LLM output to an admissible action.
- Partial Plan Evaluation: Uses LLMs to assess the coherence and viability of the predicted action steps toward achieving the goal.
- Few-shot Task Graph: Utilizes transition probabilities derived from few-shot examples to guide the action selection.
Search:
A breadth-first search (BFS) algorithm is employed to identify the optimal action plan based on the assessed scores. Low-scoring actions are pruned dynamically for efficiency, ensuring that the search space remains manageable.
Experimental Results
The framework was evaluated on COIN and CrossTask datasets, with tasks ranging in action prediction horizons from 1 to 4 steps. Noteworthy results include:
- In the zero-shot setup, VidAssist surpassed the LLM baseline by 12.9% and 6.6% success rate (SR) on COIN and CrossTask datasets, respectively, for a planning horizon of 3 future steps.
- The few-shot model extended this lead, outperforming supervised state-of-the-art methods by up to 7.7% SR on COIN for a planning horizon of 4 steps.
These results underscore the efficacy of VidAssist's deliberate planning mechanism, particularly its ability to generalize from limited annotated data, which is a significant improvement over standard LLM-based techniques.
Implications and Future Work
VidAssist's approach of integrating LLMs as both knowledge bases and assessment tools offers promising implications for the development of intelligent planning systems, particularly in contexts where annotated data is scarce or expensive to obtain. The interdisciplinary nature of this framework, combining advancements in LLMs with deliberate search methods, highlights its potential applicability beyond instructional videos to other domains requiring complex procedural planning.
Future developments may aim to enhance the visual understanding components of VidAssist, given that failures often arise from incorrect visual input processing. Additionally, leveraging more powerful LLMs and refining search algorithms and value functions could further improve the robustness and performance of the system. Extending the application domain to real-world embodied AI systems, such as personal assistants and robots, will be a natural progression for this research.
Conclusion
VidAssist demonstrates a significant step forward in harnessing the potential of LLMs for goal-oriented planning in instructional videos. Through the innovative propose-assess-search methodology, it addresses key challenges in zero- and few-shot learning scenarios, setting a new standard for future research in intelligent procedural planning.