Cross-task weakly supervised learning from instructional videos (1903.08225v2)

Published 19 Mar 2019 in cs.CV

Abstract: In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: pour egg' should be trained jointly with other tasks involvingpour' and `egg'. We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. Past data does not permit systematic studying of sharing and so we also gather a new dataset, CrossTask, aimed at assessing cross-task sharing. Our experiments demonstrate that sharing across tasks improves performance, especially when done at the component level and that our component model can parse previously unseen tasks by virtue of its compositionality.

Citations (223)

View on Semantic Scholar

Summary

The paper investigates a novel weakly supervised learning approach using instructional videos and narrations to learn sequential tasks by leveraging shared components across different tasks.
Authors introduce a component model and a weakly supervised framework trained on the new CrossTask dataset, which significantly outperforms monolithic models for task step recognition.
The research demonstrates improved recognition accuracy and generalization to novel tasks by addressing temporal constraints and sharing components, reducing the need for extensive video annotations.

Cross-task Weakly Supervised Learning from Instructional Videos

The paper "Cross-task weakly supervised learning from instructional videos" investigates a novel approach for learning visual models of sequential tasks using weak supervision derived from instructional narrations and ordered lists of steps, rather than relying on strong supervision like precise temporal annotations. The paper focuses on leveraging shared components within tasks to improve the performance of visual learning models. This is exemplified by the cross-utilization of task components like "pour" and "egg" across various cooking tasks.

The authors introduce a component model that recognizes the steps of different tasks and explore a weakly supervised learning framework that can discern these components under temporal constraints provided by narrations and step lists. The research overcomes data limitations by developing a new dataset, named CrossTask, which allows for systematic analysis of cross-task sharing.

The proposed methodology operates through the consolidation of shared task components to enhance model learning, suggesting that linking related steps across tasks—such as combining "pour egg" from making pancakes with actions in other tasks—can facilitate better visual model development. This component model approach significantly outperforms traditional models which rely on monolithic classifiers trained independently for each task step.

Key contributions of the paper include the development of a dataset focused on cross-task sharing, demonstrating the efficacy of component models in parsing unseen tasks, and showcasing improvements in learning through weak supervision frameworks that permit component sharing. Empirical results show that addressing the temporal constraints and leveraging component models improve both the recognition accuracy and the generalization to novel tasks lacking direct training.

The implications of this research are broad, impacting both practical application and theoretical foundations by proposing a structure that reduces the necessity for extensive video annotations and provides systems that can naturally adapt learning across varying but related tasks. Future developments in AI could see these methods applied to more diverse domains where task variations still share commonalities in components, thus enhancing model efficiency and reducing the burden of detailed supervision.

PDF Markdown

Cross-task weakly supervised learning from instructional videos (1903.08225v2)

Summary

Cross-task Weakly Supervised Learning from Instructional Videos

Related Papers