Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks (2310.04965v2)

Published 8 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Automatically generating scripts (i.e. sequences of key steps described in text) from video demonstrations and reasoning about the subsequent steps are crucial to the modern AI virtual assistants to guide humans to complete everyday tasks, especially unfamiliar ones. However, current methods for generative script learning rely heavily on well-structured preceding steps described in text and/or images or are limited to a certain domain, resulting in a disparity with real-world user scenarios. To address these limitations, we present a new benchmark challenge -- MultiScript, with two new tasks on task-oriented multimodal script learning: (1) multimodal script generation, and (2) subsequent step prediction. For both tasks, the input consists of a target task name and a video illustrating what has been done to complete the target task, and the expected output is (1) a sequence of structured step descriptions in text based on the demonstration video, and (2) a single text description for the subsequent step, respectively. Built from WikiHow, MultiScript covers multimodal scripts in videos and text descriptions for over 6,655 human everyday tasks across 19 diverse domains. To establish baseline performance on MultiScript, we propose two knowledge-guided multimodal generative frameworks that incorporate the task-related knowledge prompted from LLMs such as Vicuna. Experimental results show that our proposed approaches significantly improve over the competitive baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  2. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset. In European Conference on Computer Vision (ECCV).
  3. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100. International Journal of Computer Vision (IJCV), 130: 33–55.
  4. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. 376–380.
  5. Anticipative Video Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 13505–13515.
  6. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
  7. Improved Instruction Ordering in Recipe-Grounded Conversation. In Rogers, A.; Boyd-Graber, J.; and Okazaki, N., eds., Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 10086–10104. Toronto, Canada: Association for Computational Linguistics.
  8. In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video. In Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part V, 639–655. Berlin, Heidelberg: Springer-Verlag. ISBN 978-3-030-01227-4.
  9. Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. 74–81.
  10. Learning To Recognize Procedural Activities With Distant Supervision. 13853–13863.
  11. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation.
  12. Reasoning about Goals, Steps, and Temporal Ordering with WikiHow. EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 4630–4639.
  13. Goal-oriented script construction. arXiv preprint arXiv:2107.13189.
  14. Katna: Tool for automating video keyframe extraction, video compression, Image Autocrop and Smart image resize tasks.
  15. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV.
  16. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2630–2640.
  17. TL;DW? Summarizing Instructional Videos with Task Relevance and Cross-Modal Saliency. LNCS, 13694: 540–557.
  18. Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, 311–318.
  19. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8494–8502.
  20. The Art of SOCRATIC QUESTIONING: Recursive Thinking with Large Language Models. In Bouamor, H.; Pino, J.; and Bali, K., eds., Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 4177–4199. Singapore: Association for Computational Linguistics.
  21. Predicting the Next Action by Modeling the Abstract Goal.
  22. Nits-vc system for vatex video captioning challenge 2020. arXiv preprint arXiv:2006.04058.
  23. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1207–1216.
  24. Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 296–310. Online: Association for Computational Linguistics.
  25. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052.
  26. Multimedia Generative Script Learning for Task Planning.
  27. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV), 305–321.
  28. ExpertPrompting: Instructing Large Language Models to be Distinguished Experts. arXiv preprint arXiv:2305.14688.
  29. Vision-Flan:Scaling Visual Instruction Tuning.
  30. MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 11445–11465. Toronto, Canada: Association for Computational Linguistics.
  31. Visual Goal-Step Inference using wikiHow. EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings, 2167–2179.
  32. Visual Goal-Step Inference using wikiHow. arXiv preprint arXiv:2104.05845.
  33. Reasoning about Goals, Steps, and Temporal Ordering with WikiHow. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4630–4639.
  34. BERTScore: Evaluating Text Generation with BERT.
  35. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.
  36. Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation. arXiv preprint arXiv:2210.12649.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jingyuan Qi (8 papers)
  2. Minqian Liu (15 papers)
  3. Ying Shen (76 papers)
  4. Zhiyang Xu (29 papers)
  5. Lifu Huang (92 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com