Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos (2403.02782v2)

Published 5 Mar 2024 in cs.CV

Abstract: In this paper, we explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos. Existing works have attained partial success by extensively leveraging various sources of information available in the datasets, such as heavy intermediate visual observations, procedural names, or natural language step-by-step instructions, for features or supervision signals. However, the task remains formidable due to the implicit causal constraints in the sequencing of steps and the variability inherent in multiple feasible plans. To tackle these intricacies that previous efforts have overlooked, we propose to enhance the capabilities of the agent by infusing it with procedural knowledge. This knowledge, sourced from training procedure plans and structured as a directed weighted graph, equips the agent to better navigate the complexities of step sequencing and its potential variations. We coin our approach KEPP, a novel Knowledge-Enhanced Procedure Planning system, which harnesses a probabilistic procedural knowledge graph extracted from training data, effectively acting as a comprehensive textbook for the training domain. Experimental evaluations across three widely-used datasets under settings of varying complexity reveal that KEPP attains superior, state-of-the-art results while requiring only minimal supervision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Uncertainty-aware anticipation of activities. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
  2. Ht-step: Aligning instructional articles with how-to videos. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  3. Compositional foundation models for hierarchical planning. arXiv preprint arXiv:2309.08587, 2023.
  4. Unsupervised learning from narrated instruction videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4575–4583, 2016.
  5. Video-mined task graphs for keystep recognition in instructional videos. arXiv preprint arXiv:2307.08763, 2023.
  6. Anonymous authors. Active procedure planning with uncertainty-awareness in instructional videos, 2023. Under review as a conference paper at ICLR 2024. https://openreview.net/pdf?id=JDd46WodYf.
  7. Procedure planning in instructional videos via contextual modeling and model-based policy learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15611–15620, 2021.
  8. Plasma: Making small language models better procedural knowledge models for (counterfactual) planning. arXiv preprint arXiv:2305.19472, 2023.
  9. Procedure planning in instructional videos. In European Conference on Computer Vision, pages 334–350. Springer, 2020.
  10. What, when, and where?–self-supervised spatio-temporal grounding in untrimmed multi-action videos from narrated instructions. arXiv preprint arXiv:2303.16990, 2023.
  11. Weakly supervised video representation learning with unaligned text for sequential videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2437–2447, 2023.
  12. Action modifiers: Learning from adverbs in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 868–878, 2020.
  13. Video language planning. arXiv preprint arXiv:2310.10625, 2023.
  14. Flow graph to video grounding for weakly-supervised multi-step localization. In European Conference on Computer Vision, pages 319–335. Springer, 2022.
  15. Stepformer: Self-supervised step discovery and localization in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18952–18961, 2023.
  16. Who let the dogs out? modeling dog behavior from visual data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4051–4060, 2018.
  17. Vilt: Video instructions linking for complex tasks. In Proceedings of the 2nd International Workshop on Interactive Multimedia Retrieval, pages 41–47, 2022.
  18. Learning temporal sentence grounding from narrated egovideos. arXiv preprint arXiv:2310.17395, 2023.
  19. Learning to segment actions from observation and narration. arXiv preprint arXiv:2005.03684, 2020.
  20. The threedworld transport challenge: A visually guided task-and-motion planning benchmark for physically realistic embodied ai. arXiv preprint arXiv:2103.14025, 2021.
  21. Weakly-supervised action segmentation and unseen error detection in anomalous instructional videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10128–10138, 2023.
  22. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  23. Compositional prompting video-language models to understand procedure in instructional videos. Machine Intelligence Research, 20(2):249–262, 2023.
  24. Unsupervised visual-linguistic reference resolution in instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2183–2192, 2017.
  25. Finding” it”: Weakly-supervised reference-aware visual grounding in instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5948–5957, 2018.
  26. Multimodal pretraining for dense video captioning. arXiv preprint arXiv:2011.11760, 2020.
  27. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022a.
  28. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022b.
  29. Skip-plan: Procedure planning in instructional videos via condensed action space learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10297–10306, 2023.
  30. Neuro-symbolic procedural planning with commonsense prompting. In The Eleventh International Conference on Learning Representations, 2022.
  31. Multimodal procedural planning via dual text-image prompting. arXiv preprint arXiv:2305.01795, 2023.
  32. Learning to ground instructional articles in videos through narrations. arXiv preprint arXiv:2306.03802, 2023.
  33. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019.
  34. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9879–9889, 2020.
  35. Learning action changes by measuring verb-adverb textual relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23110–23118, 2023.
  36. Tl; dw? summarizing instructional videos with task relevance and cross-modal saliency. In European Conference on Computer Vision, pages 540–557. Springer, 2022.
  37. Rethinking learning approaches for long-term action anticipation. In European Conference on Computer Vision, pages 558–576. Springer, 2022.
  38. Transferring knowledge from text to video: Zero-shot anticipation for procedural actions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  39. Steps: Self-supervised key step extraction from unlabeled procedural videos. arXiv preprint arXiv:2301.00794, 2023.
  40. Learning to segment actions from visual and language instructions via differentiable weak sequence alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10156–10165, 2021.
  41. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020.
  42. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE, 2023.
  43. Ego4d goal-step: Toward hierarchical understanding of procedural activities. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  44. Look for the change: Learning object states and state-modifying actions from untrimmed web videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13956–13966, 2022.
  45. Universal planning networks: Learning generalizable representations for visuomotor control. In International Conference on Machine Learning, pages 4732–4741. PMLR, 2018.
  46. Plate: Visually-grounded planning with transformers in procedural tasks. IEEE Robotics and Automation Letters, 7(2):4924–4930, 2022.
  47. Look at what i’m doing: Self-supervised spatial grounding of narrations in instructional videos. Advances in Neural Information Processing Systems, 34:14476–14487, 2021.
  48. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216, 2019.
  49. Event-guided procedure planning from instructional videos with text supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13565–13575, 2023a.
  50. Pdpp: Projected diffusion for procedure planning in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14836–14845, 2023b.
  51. A benchmark for structured procedural knowledge extraction from cooking videos. arXiv preprint arXiv:2005.00706, 2020.
  52. Induce, edit, retrieve: Language grounded multimodal schema for instructional video retrieval. arXiv preprint arXiv:2111.09276, 2021.
  53. Hierarchical video-moment retrieval and step-captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23056–23065, 2023.
  54. Aligning step-by-step instructional diagrams to video demonstrations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2483–2492, 2023.
  55. P3iv: Probabilistic procedure planning from instructional videos with weak supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2938–2948, 2022.
  56. Antgpt: Can large language models help long-term action anticipation from videos? arXiv preprint arXiv:2307.16368, 2023.
  57. Learning procedure-aware video representation from instructional videos and their narrations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14825–14835, 2023.
  58. Procedure-aware pretraining for instructional video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10727–10738, 2023.
  59. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3537–3545, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Kumaranage Ravindu Yasas Nagasinghe (2 papers)
  2. Honglu Zhou (21 papers)
  3. Malitha Gunawardhana (11 papers)
  4. Martin Renqiang Min (44 papers)
  5. Daniel Harari (13 papers)
  6. Muhammad Haris Khan (68 papers)
Citations (2)

Summary

Knowledge-Enhanced Procedure Planning of Instructional Videos

The paper "Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos" presents an innovative approach to the task of procedure planning in instructional videos. This task involves generating a sequence of steps that transition an initial visual state to a desired goal, echoing the procedural narratives commonly found in instructional content.

Problem Context and Challenges

Instructional videos, available extensively across the internet, serve as educational tools for learning new skills. However, leveraging such videos to teach autonomous agents remains a significant challenge due to the inherent complexities in task decomposition, variation in feasible step sequences, and implicit causal constraints in procedural steps. Current methodologies have found partial success using intermediate visual observations and procedural task names to form feature sets or as signals for supervision. Yet, they often lack the foresight into multiple viable plans for the same instructional context, complicating the creation of a robust procedural model.

Proposed Approach: Knowledge-Enhanced Procedure Planning (KEPP)

To address these challenges, this paper introduces a novel framework termed Knowledge-Enhanced Procedure Planning (KEPP). The core idea is to arm the planning model with a rich procedural knowledge repository, formulated as a Probabilistic Procedural Knowledge Graph (P2KG) based on training data. This graph serves as a comprehensive 'textbook' that guides the agent in synthesizing coherent procedural plans.

Methodology and Implementation

The proposed KEPP system operates in a multi-step framework:

  1. Problem Decomposition: KEPP breaks down the procedure planning task into subcomponents: predicting initial and final steps from the visual states and subsequently generating intermediate steps based on procedural knowledge from P2KG.
  2. Probabilistic Procedural Knowledge Graph (P2KG): This graph encodes procedural steps and their transition probabilities across tasks, facilitating navigation through complex step sequences. It captures the variation and probability distributions of multiple feasible step paths.
  3. Conditioned Projected Diffusion Model: Both the initial step prediction and the procedure planning models employ a diffusion model that projects conditionally guided information ensuring they remain constant throughout the denoising steps. This model allows the prediction of the action sequence by leveraging both visual cues and knowledge extracted from P2KG.
  4. Evaluation: KEPP's efficacy is validated through experimental evaluations across datasets like CrossTask, COIN, and NIV, showcasing superior results with state-of-the-art precision while requiring minimal supervision.

Insights and Future Directions

The practical implications of KEPP extend significantly for AI applications in robotics and automated systems interpreting instructional content. By leveraging a probabilistic graph-based contextual understanding, KEPP improves accuracy and reduces errors in sequential planning tasks, essential for real-world scenario applications like robotic assistance.

Theoretically, the framework opens new potential in incorporating multimodal learning inputs and demonstrates the viability of structured knowledge graphs to model and solve exceedingly complex planning tasks. Future explorations could enhance this framework by integrating robust natural LLMs for enriched contextual processing, further augmenting the accuracy and efficiency of procedural understanding in AI systems.

Youtube Logo Streamline Icon: https://streamlinehq.com