Video Object Segmentation using Teacher-Student Adaptation in a Human Robot Interaction (HRI) Setting (1810.07733v4)

Published 17 Oct 2018 in cs.CV

Abstract: Video object segmentation is an essential task in robot manipulation to facilitate grasping and learning affordances. Incremental learning is important for robotics in unstructured environments, since the total number of objects and their variations can be intractable. Inspired by the children learning process, human robot interaction (HRI) can be utilized to teach robots about the world guided by humans similar to how children learn from a parent or a teacher. A human teacher can show potential objects of interest to the robot, which is able to self adapt to the teaching signal without providing manual segmentation labels. We propose a novel teacher-student learning paradigm to teach robots about their surrounding environment. A two-stream motion and appearance "teacher" network provides pseudo-labels to adapt an appearance "student" network. The student network is able to segment the newly learned objects in other scenes, whether they are static or in motion. We also introduce a carefully designed dataset that serves the proposed HRI setup, denoted as (I)nteractive (V)ideo (O)bject (S)egmentation. Our IVOS dataset contains teaching videos of different objects, and manipulation tasks. Unlike previous datasets, IVOS provides manipulation tasks sequences with segmentation annotation along with the waypoints for the robot trajectories. It also provides segmentation annotation for the different transformations such as translation, scale, planar rotation, and out-of-plane rotation. Our proposed adaptation method outperforms the state-of-the-art on DAVIS and FBMS with 6.8% and 1.2% in F-measure respectively. It improves over the baseline on IVOS dataset with 46.1% and 25.9% in mIoU.

Citations (67)

View on Semantic Scholar

Summary

The paper introduces a teacher-student model that uses a two-stream network to generate pseudo-labels, enabling incremental learning in video object segmentation.
The approach is validated on DAVIS, FBMS, and the new IVOS dataset, achieving significant improvements of 6.8% and 1.2% in F-measure and notable mIoU gains.
The paper demonstrates practical implications for HRI by enhancing robotic object recognition and adaptability in unstructured, dynamic environments.

Overview of Video Object Segmentation in Human-Robot Interaction

The paper "Video Object Segmentation using Teacher-Student Adaptation in a Human Robot Interaction (HRI) Setting" presents a novel approach to video object segmentation (VOS), specifically within the domain of human-robot interaction (HRI). Utilizing a teacher-student learning paradigm, the authors propose a system where a robot is incrementally trained by a human teacher to segment objects without the need for manual annotations. This is achieved through a two-stream network providing pseudo-labels to guide the adaptation of a single-stream network, with implications for improving robotics within unstructured environments.

Methodology

At the core of this research is the teacher-student adaptation model. The teacher model consists of a two-stream motion and appearance network, generating pseudo-labels from motion cues. This output supports the adaptation of a student network, which operates solely on appearance cues. The innovation lies in the incremental learning mechanism inspired by the process of children learning from adults, allowing robots to segment new objects in diverse settings, whether they are static or dynamic.

The authors leverage a new dataset (Interactive Video Object Segmentation - IVOS) comprising videos of objects undergoing various transformations and manipulation tasks. The careful design of this dataset aids in evaluating the incremental learning capabilities of the adaptation method.

Results

The proposed method demonstrates superior performance against current benchmarks, enhancing the state-of-the-art results for DAVIS and FBMS datasets with significant improvements of 6.8% and 1.2% in F-measure, respectively. On the newly introduced IVOS dataset, the adaptation method markedly outperforms baseline models with notable gains in mIoU, marking an advancement in VOS capabilities for HRI applications.

Implications for Robotics

The implications of this research extend into practical applications of robotics, particularly in scenarios requiring adaptive object recognition and manipulation tasks often encountered in unstructured environments. It highlights a shift towards human-centered AI, focusing on learning through interaction without an extensive need for pre-existing labels.

Future Research Directions

The paper opens avenues for more explorations into adaptive learning and the interface of human-instructed robotic training. Future research could expand into areas involving the integration of semantic understanding and trajectory learning from visual cues, further improving robot autonomy and effectiveness in dynamic environments.

This paper signifies an important development in video object segmentation by embracing human-guided learning frameworks. By enabling robots to learn and adapt using motion-based pseudo-labels, it provides a scalable and efficient path to enhancing cognitive and manipulative capabilities in robotic systems.

PDF Markdown

Related Papers

YouTube

Show All Videos