Neural Task Graphs: Generalizing to Unseen Tasks from a Single Video Demonstration (1807.03480v2)

Published 10 Jul 2018 in cs.CV, cs.AI, cs.LG, and cs.RO

Abstract: Our goal is to generate a policy to complete an unseen task given just a single video demonstration of the task in a given domain. We hypothesize that to successfully generalize to unseen complex tasks from a single video demonstration, it is necessary to explicitly incorporate the compositional structure of the tasks into the model. To this end, we propose Neural Task Graph (NTG) Networks, which use conjugate task graph as the intermediate representation to modularize both the video demonstration and the derived policy. We empirically show NTG achieves inter-task generalization on two complex tasks: Block Stacking in BulletPhysics and Object Collection in AI2-THOR. NTG improves data efficiency with visual input as well as achieve strong generalization without the need for dense hierarchical supervision. We further show that similar performance trends hold when applied to real-world data. We show that NTG can effectively predict task structure on the JIGSAWS surgical dataset and generalize to unseen tasks.

Citations (135)

View on Semantic Scholar

Summary

An Overview of Neural Task Graphs: Generalizing to Unseen Tasks from a Single Video Demonstration

The paper "Neural Task Graphs: Generalizing to Unseen Tasks from a Single Video Demonstration" presents an innovative methodology developed for advancing the capabilities of one-shot visual imitation learning. This approach allows agents to generalize to unseen tasks, observed through a single video demonstration, without requiring dense hierarchical supervision, which is crucial for enabling such models in dynamic real-world environments. This research is conducted by the team from Stanford University, including De-An Huang, Suraj Nair, Danfei Xu, Yuke Zhu, Animesh Garg, Li Fei-Fei, Silvio Savarese, and Juan Carlos Niebles.

Core Contributions

The essence of this work lies in the introduction of Neural Task Graph (NTG) Networks. These networks utilize a novel intermediate representation, the Conjugate Task Graph (CTG), which effectively captures task structure by focusing on actions as nodes and states as edges. This helps overcome the challenges posed by the explosion of state space when generating traditional task graphs. The NTG framework consists of a graph generator and an execution engine, both of which leverage the compositional structure of tasks to improve data efficiency and allow generalization to new, unforeseen tasks.

Empirical Validation

The empirical evaluation offered by the paper focuses on two complex domains: Block Stacking in BulletPhysics and Object Collection in AI2-THOR. The results clearly demonstrate the data efficiency improvements achieved by NTG when directly imitating video input. More importantly, NTG achieved strong generalization to tasks with unseen configurations, outperforming baseline models that do not incorporate structured task representation.

Methodological Insights

The NTG system's effectiveness lies in its structural decomposition. The demo interpreter utilizes a seq2seq model to extract sequence information from video demonstrations, setting initial edges in the CTG. Subsequently, the Graph Completion Network functions to add additional edges by learning graph state transitions, capturing task-flexible properties. The execution engine combines node localization and edge classification to execute the intended task graph as policy, demonstrating adaptability to changing or unseen conditions.

Implications and Future Directions

The implications of this work are twofold — practical in terms of deployment in environments requiring rapid adaptation and minimal supervision, and theoretical in enriching task representation models for AI systems. It opens pathways for future developments in AI robotics, specifically in areas necessitating detailed task management and flexibility, such as manufacturing and autonomous systems. Moreover, the adaptability of NTG networks to real-world data, as evidenced by their application to surgical tasks in the JIGSAWS dataset, suggests promising extensions into domains requiring acute precision and complex task execution.

Conclusion

Overall, the paper advances one-shot visual imitation learning by embedding task compositionality into policy derivation processes via CTG intermediates. Through orchestrating action sequencing and flexible state understanding, NTG presents a leap forward in task generalization. Although challenges remain in scaling such systems and refining models to capture nuanced task specifics, NTG networks offer a robust framework keenly aligned with the needs for autonomous interaction in diverse and dynamic environments.

Related Papers

YouTube

Show All Videos