Visual Goal-Step Inference using wikiHow (2104.05845v2)

Published 12 Apr 2021 in cs.CV, cs.AI, cs.CL, cs.LG, and cs.MM

Abstract: Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.

Citations (38)

View on Semantic Scholar

Summary

The paper presents a novel VGSI task where models select a step image based on a given textual goal.
It leverages a dataset of 772,277 wikiHow images to expose challenges in multimodal learning, with gaps of up to 40% in model versus human performance.
Transfer learning experiments demonstrate a 15-20% accuracy boost on out-of-domain tasks, underscoring the potential of visual goal-step representations.

Visual Goal-Step Inference using wikiHow

The paper "Visual Goal-Step Inference using wikiHow" addresses a significant challenge in artificial intelligence: enabling systems to reason about human procedural activities through multimodal understanding. The authors introduce the Visual Goal-Step Inference (VGSI) task, a novel framework in which a model is presented with a textual goal and must select one image from a set of four that plausibly represents a step towards achieving that goal. This task is designed using a large dataset curated from wikiHow, comprising 772,277 images illustrating various human actions and procedures.

Objective and Dataset

The principal objective of VGSI is to extend the capabilities of AI systems in understanding complex human events, emphasizing the transition from text-based to multimodal representations. While prior work in NLP predominantly focused on goal-step inference using textual data, VGSI incorporates visual data to enhance reasoning about procedural events. Using wikiHow as a resource, the dataset reflects diverse everyday tasks with hierarchical relationships among goals, methods, and steps, each typically accompanied by relevant images.

Multimodal Learning Challenge

Multimodal learning introduces intricate challenges, particularly when tasks require contextual understanding beyond standard image captioning. VGSI is structured to capture such complexities, with the evaluation highlighting the difficulty faced by state-of-the-art models like DeViSE, Similarity Networks, Triplet Networks, and LXMERT. Human performance is used as a benchmark, revealing gaps in accuracy—up to 40% in some sampling strategies—between machine and human understanding. This underlines the inherent complexity and demands of multimodal reasoning for AI models.

Transfer Learning

Understanding goal-step relations visually has practical implications, particularly in transfer learning scenarios. The paper demonstrates that models trained on the wikiHow dataset can significantly improve VGSI accuracy on out-of-domain datasets such as COIN and HowTo100m, achieving an accuracy boost of 15-20%. This suggests that multimodal representations learned from wikiHow can be generalized across different domains, highlighting the potential for AI systems to adapt learned knowledge to novel contexts with minimal data.

Future Implications and Theoretical Contribution

The research holds notable implications for developing vision-enabled dialogue systems and enhancing interaction models. By integrating visual understanding of procedural events, systems can more reliably predict intermediate steps and provide contextual recommendations, thereby improving task-oriented interactions. From a theoretical standpoint, the VGSI task challenges conventional image-text models to incorporate deeper reasoning layers, possibly prompting future advancements in AI's interpretative capabilities.

Conclusion

This paper presents a comprehensive framework for multimodal reasoning that bridges the gap between textual descriptions and visual representation of procedural events, using the rich resource of wikiHow. The VGSI task not only serves as a rigorous testbed for current models but also offers a template for future AI systems capable of nuanced understanding of human activities. The capacity for learned multimodal representations to transcend domains positions VGSI as a stepping stone towards more adaptable and context-aware AI technologies.