Time-Contrastive Networks: Self-Supervised Learning from Video (1704.06888v3)

Published 23 Apr 2017 in cs.CV and cs.RO

Abstract: We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that captures the relationships between end-effectors (hands or robot grippers) and the environment, object attributes, and body pose. We train our representations using a metric learning loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. In other words, the model simultaneously learns to recognize what is common between different-looking images, and what is different between similar-looking images. This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm. While representations are learned from an unlabeled collection of task-related videos, robot behaviors such as pouring are learned by watching a single 3rd-person demonstration by a human. Reward functions obtained by following the human demonstrations under the learned representation enable efficient reinforcement learning that is practical for real-world robotic systems. Video results, open-source code and dataset are available at https://sermanet.github.io/imitate

Authors (7)

Pierre Sermanet (37 papers)
Corey Lynch (18 papers)
Yevgen Chebotar (28 papers)
Jasmine Hsu (12 papers)
Eric Jang (19 papers)
Stefan Schaal (73 papers)
Sergey Levine (531 papers)

Citations (778)

View on Semantic Scholar

Summary

An Essay on Time-Contrastive Networks: Self-Supervised Learning from Video

Overview

In the paper "Time-Contrastive Networks: Self-Supervised Learning from Video," the authors introduce a self-supervised learning framework for developing representations from unlabeled video datasets and demonstrate its application in robotic imitation tasks. Specifically, the paper explores techniques to allow robots to learn behaviors and mimic human actions without explicit supervision by leveraging multi-view videos and metric learning losses. The paper focuses on two robotic imitation settings: imitating object interactions from human videos and mimicking human poses.

Methodology

The core innovation in the proposed framework is the concept of Time-Contrastive Networks (TCN). The TCN methodology leverages metric learning using simultaneous multi-view observations as positive pairs and neighboring frames in the same sequence as negative pairs. This approach forces the learned representation to capture the essence of interactions in a manner invariant to viewpoints and robust to various nuisance factors like occlusions, motion blur, lighting changes, and backgrounds. The main objective is to develop embeddings that differentiate what is similar across different-looking images and identify differences in similar-looking images spanning time.

Two main learning paradigms are studied:

Multi-view Time-Contrastive (TC) learning: This approach utilizes videos captured from multiple viewpoints to align temporally corresponding frames in feature space.
Single-view Time-Contrastive learning: This approach considers frames selected from short positive and negative temporal windows in single-view sequences.

Experimental Setup

The experiments conducted assess three main aspects:

Capturing Visual Representation of Object Interaction: The effectiveness of the TCN model in distinguishing various attributes of a pouring task is evaluated via classification and alignment errors on hand-labeled attributes and events in the pouring video sequences.
Learning Object Manipulation Skills: The TCN-derived reward functions are used to train robotic agents in both simulated environments (dish rack task) and real-world settings (pouring beads task) using reinforcement learning techniques.
Direct Human Pose Imitation: The TCN model enables robots to perform continuous, real-time imitation of human poses through regression methods without explicit pose annotations, leveraging self-supervised learning paradigms.

Results and Implications

The paper demonstrates that the TCN model outperforms baseline methods (including models pretrained on ImageNet and shuffle-and-learn approaches) in terms of classification accuracy of task-relevant attributes and alignment of video sequences. Specifically noteworthy are the results showing significantly reduced errors and training time achieved using multi-view TCN models compared to their single-view counterparts.

The practical implications of the research are substantial:

Robust Robotic Learning: By training on natural, unlabelled video data, the TCN approach presents a scalable alternative to traditional supervised learning methods that require extensively labeled datasets. This capability is particularly advantageous for developing robotic systems that need to learn a wide range of tasks in dynamic, real-world environments.
Human-Robot Interaction: The research opens avenues for developing robotic systems capable of learning complex human-interactive tasks by observing human demonstrations. This ability could notably enhance the development of assistive robotics and collaborative robots operating alongside humans.

Future Directions

The research suggests promising future directions:

Multi-Modal Learning: Extending the TCN framework to incorporate multi-modal data (e.g., audio, tactile sensing) could further enhance the robustness and applicability of learned representations.
Task-Agnostic Embeddings: Investigating the construction of universal embeddings capable of handling multiple tasks concurrently could leverage larger multi-task datasets and reduce the need for task-specific models.
Automated Data Capture: With robots becoming more prevalent, automated multi-viewpoint data capture could be integrated into robotic systems, enabling continual learning and adaptation.

Conclusion

The paper "Time-Contrastive Networks: Self-Supervised Learning from Video" presents a novel and efficient self-supervised approach to developing representations from video data that can generalize across various robotic tasks. By leveraging multi-view correspondence and temporal coherency, the TCN framework facilitates the learning of robust, viewpoint-invariant embeddings. The demonstrated applications in object interaction and pose imitation underscore the potential for scalable, autonomous robotic learning systems capable of adapting to complex human environments.

PDF Markdown