Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos (2403.02782v2)
Abstract: In this paper, we explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos. Existing works have attained partial success by extensively leveraging various sources of information available in the datasets, such as heavy intermediate visual observations, procedural names, or natural language step-by-step instructions, for features or supervision signals. However, the task remains formidable due to the implicit causal constraints in the sequencing of steps and the variability inherent in multiple feasible plans. To tackle these intricacies that previous efforts have overlooked, we propose to enhance the capabilities of the agent by infusing it with procedural knowledge. This knowledge, sourced from training procedure plans and structured as a directed weighted graph, equips the agent to better navigate the complexities of step sequencing and its potential variations. We coin our approach KEPP, a novel Knowledge-Enhanced Procedure Planning system, which harnesses a probabilistic procedural knowledge graph extracted from training data, effectively acting as a comprehensive textbook for the training domain. Experimental evaluations across three widely-used datasets under settings of varying complexity reveal that KEPP attains superior, state-of-the-art results while requiring only minimal supervision.
- Uncertainty-aware anticipation of activities. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
- Ht-step: Aligning instructional articles with how-to videos. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Compositional foundation models for hierarchical planning. arXiv preprint arXiv:2309.08587, 2023.
- Unsupervised learning from narrated instruction videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4575–4583, 2016.
- Video-mined task graphs for keystep recognition in instructional videos. arXiv preprint arXiv:2307.08763, 2023.
- Anonymous authors. Active procedure planning with uncertainty-awareness in instructional videos, 2023. Under review as a conference paper at ICLR 2024. https://openreview.net/pdf?id=JDd46WodYf.
- Procedure planning in instructional videos via contextual modeling and model-based policy learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15611–15620, 2021.
- Plasma: Making small language models better procedural knowledge models for (counterfactual) planning. arXiv preprint arXiv:2305.19472, 2023.
- Procedure planning in instructional videos. In European Conference on Computer Vision, pages 334–350. Springer, 2020.
- What, when, and where?–self-supervised spatio-temporal grounding in untrimmed multi-action videos from narrated instructions. arXiv preprint arXiv:2303.16990, 2023.
- Weakly supervised video representation learning with unaligned text for sequential videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2437–2447, 2023.
- Action modifiers: Learning from adverbs in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 868–878, 2020.
- Video language planning. arXiv preprint arXiv:2310.10625, 2023.
- Flow graph to video grounding for weakly-supervised multi-step localization. In European Conference on Computer Vision, pages 319–335. Springer, 2022.
- Stepformer: Self-supervised step discovery and localization in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18952–18961, 2023.
- Who let the dogs out? modeling dog behavior from visual data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4051–4060, 2018.
- Vilt: Video instructions linking for complex tasks. In Proceedings of the 2nd International Workshop on Interactive Multimedia Retrieval, pages 41–47, 2022.
- Learning temporal sentence grounding from narrated egovideos. arXiv preprint arXiv:2310.17395, 2023.
- Learning to segment actions from observation and narration. arXiv preprint arXiv:2005.03684, 2020.
- The threedworld transport challenge: A visually guided task-and-motion planning benchmark for physically realistic embodied ai. arXiv preprint arXiv:2103.14025, 2021.
- Weakly-supervised action segmentation and unseen error detection in anomalous instructional videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10128–10138, 2023.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Compositional prompting video-language models to understand procedure in instructional videos. Machine Intelligence Research, 20(2):249–262, 2023.
- Unsupervised visual-linguistic reference resolution in instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2183–2192, 2017.
- Finding” it”: Weakly-supervised reference-aware visual grounding in instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5948–5957, 2018.
- Multimodal pretraining for dense video captioning. arXiv preprint arXiv:2011.11760, 2020.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022a.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022b.
- Skip-plan: Procedure planning in instructional videos via condensed action space learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10297–10306, 2023.
- Neuro-symbolic procedural planning with commonsense prompting. In The Eleventh International Conference on Learning Representations, 2022.
- Multimodal procedural planning via dual text-image prompting. arXiv preprint arXiv:2305.01795, 2023.
- Learning to ground instructional articles in videos through narrations. arXiv preprint arXiv:2306.03802, 2023.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019.
- End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9879–9889, 2020.
- Learning action changes by measuring verb-adverb textual relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23110–23118, 2023.
- Tl; dw? summarizing instructional videos with task relevance and cross-modal saliency. In European Conference on Computer Vision, pages 540–557. Springer, 2022.
- Rethinking learning approaches for long-term action anticipation. In European Conference on Computer Vision, pages 558–576. Springer, 2022.
- Transferring knowledge from text to video: Zero-shot anticipation for procedural actions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Steps: Self-supervised key step extraction from unlabeled procedural videos. arXiv preprint arXiv:2301.00794, 2023.
- Learning to segment actions from visual and language instructions via differentiable weak sequence alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10156–10165, 2021.
- Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020.
- Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE, 2023.
- Ego4d goal-step: Toward hierarchical understanding of procedural activities. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Look for the change: Learning object states and state-modifying actions from untrimmed web videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13956–13966, 2022.
- Universal planning networks: Learning generalizable representations for visuomotor control. In International Conference on Machine Learning, pages 4732–4741. PMLR, 2018.
- Plate: Visually-grounded planning with transformers in procedural tasks. IEEE Robotics and Automation Letters, 7(2):4924–4930, 2022.
- Look at what i’m doing: Self-supervised spatial grounding of narrations in instructional videos. Advances in Neural Information Processing Systems, 34:14476–14487, 2021.
- Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216, 2019.
- Event-guided procedure planning from instructional videos with text supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13565–13575, 2023a.
- Pdpp: Projected diffusion for procedure planning in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14836–14845, 2023b.
- A benchmark for structured procedural knowledge extraction from cooking videos. arXiv preprint arXiv:2005.00706, 2020.
- Induce, edit, retrieve: Language grounded multimodal schema for instructional video retrieval. arXiv preprint arXiv:2111.09276, 2021.
- Hierarchical video-moment retrieval and step-captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23056–23065, 2023.
- Aligning step-by-step instructional diagrams to video demonstrations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2483–2492, 2023.
- P3iv: Probabilistic procedure planning from instructional videos with weak supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2938–2948, 2022.
- Antgpt: Can large language models help long-term action anticipation from videos? arXiv preprint arXiv:2307.16368, 2023.
- Learning procedure-aware video representation from instructional videos and their narrations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14825–14835, 2023.
- Procedure-aware pretraining for instructional video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10727–10738, 2023.
- Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3537–3545, 2019.
- Kumaranage Ravindu Yasas Nagasinghe (2 papers)
- Honglu Zhou (21 papers)
- Malitha Gunawardhana (11 papers)
- Martin Renqiang Min (44 papers)
- Daniel Harari (13 papers)
- Muhammad Haris Khan (68 papers)