Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals (2110.08486v4)
Abstract: The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures are often communicated through a combination of texts and images. Such capability is essential for applications such as sequential task planning and multi-source instruction summarization. While humans are capable of reasoning about and sequencing unordered multimodal procedural instructions, whether current machine learning models have such essential capability is still an open question. In this work, we benchmark models' capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations. We find models not only perform significantly worse than humans but also seem incapable of efficiently utilizing the multimodal information. To improve machines' performance on multimodal event sequencing, we propose sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images, resulting in > 5% significant improvements.
- Sort story: Sorting jumbled images and captions into stories. In Empirical Methods in Natural Language Processing (EMNLP).
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.
- Mechanical, behavioural and intentional understanding of picture stories in autistic children. In British Journal of developmental psychology, volume 4, pages 113–125. Wiley Online Library.
- Multimodal pretraining unmasked: Unifying the vision and language BERTs. arXiv preprint arXiv:2011.15124.
- Ordering sentences and paragraphs with pre-trained encoder-decoder transformers and pointer ensembles. In Proceedings of the 21st ACM Symposium on Document Engineering, pages 1–9.
- Neural sentence ordering. arXiv preprint arXiv:1607.06952.
- Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
- Uniter: Learning universal image-text representations. In European Conference on Computer Vision (ECCV).
- Deep attentive sentence ordering network. In Empirical Methods in Natural Language Processing (EMNLP), pages 4340–4349.
- BERT-enhanced relational sentence ordering network. In Empirical Methods in Natural Language Processing (EMNLP), pages 6310–6320. Association for Computational Linguistics.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (NAACL-HLT), pages 4171–4186.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR).
- Learning temporal dynamics from cycles in narrated video. In International Conference on Computer Vision (ICCV).
- Lorenzo Garattoni and Mauro Birattari. 2018. Autonomous task sequencing in a robot swarm. In Science Robotics, volume 3. Science Robotics.
- End-to-end neural sentence ordering using pointer network. arXiv preprint arXiv:1611.04953.
- Wikihow.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Visual storytelling. In North American Chapter of the Association for Computational Linguistics (NAACL-HLT).
- Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849.
- Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4999–5007.
- Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
- Mahnaz Koupaee and William Yang Wang. 2018. Wikihow: A large scale text summarization dataset. arXiv preprint arXiv:1810.09305.
- Mirella Lapata. 2003. Probabilistic text structuring: Experiments with sentence ordering. In Association for Computational Linguistics (ACL), pages 545–552.
- Slm: Learning a discourse language representation with sentence unshuffling. In Empirical Methods in Natural Language Processing (EMNLP).
- Unsupervised representation learning by sorting sequences. In International Conference on Computer Vision (ICCV), pages 667–676.
- Hero: Hierarchical encoder for video+ language omni-representation pre-training. In Empirical Methods in Natural Language Processing (EMNLP).
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
- Multi-level multimodal transformer network for multimodal recipe comprehension. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1781–1784.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Sentence ordering and coherence modeling using recurrent neural networks. In Association for the Advancement of Artificial Intelligence (AAAI).
- Children’s representation and imitation of events: How goal organization influences 3-year-old children’s memory for action sequences. In Cognitive Science, volume 41, pages 1904–1933. Wiley Online Library.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Neural Information Processing Systems (NeurIPS), pages 13–23.
- Topic-guided coherence modeling for sentence ordering by preserving global and local information. In Empirical Methods in Natural Language Processing (EMNLP), pages 2273–2283.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.
- Faster r-cnn: Towards real-time object detection with region proposal networks. 39(6):1137–1149.
- proScript: Partially ordered scripts generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2138–2149. Association for Computational Linguistics.
- How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383.
- Vl-bert: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations (ICLR).
- Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7464–7473.
- Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
- A dataset for tracking entities in open domain procedural text. In Empirical Methods in Natural Language Processing (EMNLP), pages 6408–6417.
- Silvan S Tomkins. 1952. The tomkins-horn picture arrangement test. In Transactions of the New York Academy of Sciences.
- Attention is all you need. In Neural Information Processing Systems (NeurIPS), pages 5998–6008.
- Order matters: Sequence to sequence for sets. In International Conference on Learning Representations (ICLR).
- Order matters: Shuffling sequence generation for video prediction. In 30th British Machine Vision Conference 2019, BMVC 2019. Newcastle University.
- Cookie: Contrastive cross-modal knowledge sharing pre-training for vision-language representation. In International Conference on Computer Vision (ICCV), pages 2208–2217.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10334–10343.
- Recipeqa: A challenge dataset for multimodal comprehension of cooking recipes. In Empirical Methods in Natural Language Processing (EMNLP).
- Visual goal-step inference using wikihow. arXiv preprint arXiv:2104.05845.
- Merlot: Multimodal neural script knowledge models. In Neural Information Processing Systems (NeurIPS).
- Intent detection with WikiHow. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 328–333, Suzhou, China. Association for Computational Linguistics.
- Reasoning about goals, steps, and temporal ordering with WikiHow. In Empirical Methods in Natural Language Processing (EMNLP), pages 4630–4639.
- Learning household task knowledge from WikiHow descriptions. In Proceedings of the 5th Workshop on Semantic Deep Learning (SemDeep-5), pages 50–56, Macau, China. Association for Computational Linguistics.