Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos (2403.01599v1)

Published 3 Mar 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations. The motivation of this problem is to learn a structured and plannable state and action space. Recent works succeeded in sequence modeling of steps with only sequence-level annotations accessible during training, which overlooked the roles of states in the procedures. In this work, we point out that State CHangEs MAtter (SCHEMA) for procedure planning in instructional videos. We aim to establish a more structured state space by investigating the causal relations between steps and states in procedures. Specifically, we explicitly represent each step as state changes and track the state changes in procedures. For step representation, we leveraged the commonsense knowledge in LLMs to describe the state changes of steps via our designed chain-of-thought prompting. For state change tracking, we align visual state observations with language state descriptions via cross-modal contrastive learning, and explicitly model the intermediate states of the procedure using LLM-generated state descriptions. Experiments on CrossTask, COIN, and NIV benchmark datasets demonstrate that our proposed SCHEMA model achieves state-of-the-art performance and obtains explainable visualizations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Uncertainty-aware anticipation of activities. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp.  0–0, 2019.
  2. Unsupervised learning from narrated instruction videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  4575–4583, 2016.
  3. Joint discovery of object states and manipulation actions. In Proceedings of the IEEE International Conference on Computer Vision, pp.  2127–2136, 2017.
  4. Procedure planning in instructional videos via contextual modeling and model-based policy learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  15611–15620, 2021.
  5. Interpreting and executing recipes with a cooking robot. In Experimental Robotics: The 13th International Symposium on Experimental Robotics, pp.  481–495. Springer, 2013.
  6. Do as i can, not as i say: Grounding language in robotic affordances. In 6th Annual Conference on Robot Learning, 2022.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Procedure planning in instructional videos. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI, pp.  334–350. Springer, 2020.
  9. Stepformer: Self-supervised step discovery and localization in instructional videos. arXiv preprint arXiv:2304.13265, 2023.
  10. Who let the dogs out? modeling dog behavior from visual data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  4051–4060, 2018.
  11. Peter Jansen. Visually-grounded planning without vision: Language models infer detailed plans from high-level instructions. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  4412–4417, 2020.
  12. Deep hand: How to train a cnn on 1 million hand images when your data is continuous and weakly labelled. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3793–3802, 2016.
  13. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4297–4305, 2017.
  14. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  780–787, 2014.
  15. Understand the dynamic world: An end-to-end knowledge informed framework for open domain entity state tracking. arXiv preprint arXiv:2304.13854, 2023.
  16. Learning to recognize procedural activities with distant supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13853–13863, 2022.
  17. Visual classification via description from large language models. arXiv preprint arXiv:2210.07183, 2022.
  18. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2630–2640, 2019.
  19. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9879–9889, 2020.
  20. Tracking state changes in procedural text: a challenge dataset and models for process paragraph comprehension. arXiv preprint arXiv:1805.06975, 2018.
  21. The materials science procedural text corpus: Annotating materials synthesis procedures with shallow semantic structures. arXiv preprint arXiv:1905.06939, 2019.
  22. State-aware video procedural captioning. In Proceedings of the 29th ACM International Conference on Multimedia, pp.  1766–1774, 2021.
  23. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  24. Weakly supervised action learning with rnn based fine-to-coarse modeling. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.  754–763, 2017.
  25. Neuralnetwork-viterbi: A framework for weakly supervised video learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.  7386–7395, 2018.
  26. Recognizing fine-grained and composite activities using hand-centric features and script data. 2016.
  27. Chop & learn: Recognizing and generating object-state compositions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  20247–20258, 2023.
  28. Visual recipe flow: A dataset for learning visual state changes of objects with recipe flows. In Proceedings of the 29th International Conference on Computational Linguistics, pp.  3570–3577, 2022.
  29. Look for the change: Learning object states and state-modifying actions from untrimmed web videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13956–13966, 2022a.
  30. Multi-task learning of object state changes from uncurated videos. arXiv preprint arXiv:2211.13500, 2022b.
  31. Genhowto: Learning to generate actions and state transformations from instructional videos. arXiv preprint arXiv:2312.07322, 2023.
  32. Universal planning networks: Learning generalizable representations for visuomotor control. In International Conference on Machine Learning, pp.  4732–4741. PMLR, 2018.
  33. Plate: Visually-grounded planning with transformers in procedural tasks. IEEE Robotics and Automation Letters, 7(2):4924–4930, 2022.
  34. Reasoning about actions and state changes by injecting commonsense knowledge. arXiv preprint arXiv:1808.10012, 2018.
  35. A dataset for tracking entities in open domain procedural text. arXiv preprint arXiv:2011.08092, 2020.
  36. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1207–1216, 2019.
  37. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  1507–1514, 2011.
  38. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  39. Andrew Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory, 13(2):260–269, 1967.
  40. Event-guided procedure planning from instructional videos with text supervision. arXiv preprint arXiv:2308.08885, 2023a.
  41. Pdpp: Projected diffusion for procedure planning in instructional videos. arXiv preprint arXiv:2303.14676, 2023b.
  42. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  43. Openpi-c: A better benchmark and stronger baseline for open-vocabulary state tracking. arXiv preprint arXiv:2306.00887, 2023.
  44. Learning object state changes in videos: An open-world perspective. arXiv preprint arXiv:2312.11782, 2023.
  45. Zi Yang and Eric Nyberg. Leveraging procedural knowledge for task-oriented search. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.  513–522, 2015.
  46. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  47. Reasoning about goals, steps, and temporal ordering with wikihow. arXiv preprint arXiv:2009.07690, 2020.
  48. Openpi2. 0: An improved dataset for entity tracking in texts. arXiv preprint arXiv:2305.14603, 2023.
  49. P3iv: Probabilistic procedure planning from instructional videos with weak supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2938–2948, 2022.
  50. Towards automatic learning of procedures from web instructional videos. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  51. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3537–3545, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yulei Niu (32 papers)
  2. Wenliang Guo (1 paper)
  3. Long Chen (395 papers)
  4. Xudong Lin (37 papers)
  5. Shih-Fu Chang (131 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.