MaskRNN: Instance Level Video Object Segmentation (1803.11187v1)

Published 29 Mar 2018 in cs.CV

Abstract: Instance level video object segmentation is an important technique for video editing and compression. To capture the temporal coherence, in this paper, we develop MaskRNN, a recurrent neural net approach which fuses in each frame the output of two deep nets for each object instance -- a binary segmentation net providing a mask and a localization net providing a bounding box. Due to the recurrent component and the localization component, our method is able to take advantage of long-term temporal structures of the video data as well as rejecting outliers. We validate the proposed algorithm on three challenging benchmark datasets, the DAVIS-2016 dataset, the DAVIS-2017 dataset, and the Segtrack v2 dataset, achieving state-of-the-art performance on all of them.

Authors (3)

Yuan-Ting Hu (12 papers)
Jia-Bin Huang (106 papers)
Alexander G. Schwing (62 papers)

Citations (171)

View on Semantic Scholar

Summary

The paper introduces MaskRNN, a novel framework for instance-level video object segmentation combining RNNs, binary segmentation, and object localization, notably integrating long-term temporal information and a bottom-up approach.
MaskRNN achieves state-of-the-art results on challenging datasets like DAVIS-2017 and Segtrack v2, demonstrating substantial performance improvements through its integrated components.
MaskRNN's findings have practical implications for video editing and theoretical significance for future RNN research, suggesting potential extensions to real-time systems.

Overview of MaskRNN: Instance Level Video Object Segmentation

The paper introduces MaskRNN, an innovative framework for instance-level video object segmentation that leverages the strengths of recurrent neural networks (RNNs) in conjunction with binary segmentation and object localization. The authors focus on enhancing temporal coherence in video data processing, aiming to improve applications like video editing and compression.

Key Methodological Contributions

MaskRNN distinguishes itself by addressing the challenges presented by deforming shapes, rapid movements, and occlusion among multiple objects in videos. Unlike traditional methods that rely heavily on geometric constraints and rigid scene assumptions, MaskRNN employs a bottom-up approach. This approach tracks and segments individual objects using two deep networks: one for binary segmentation masks and another for bounding box localization net recovery. Importantly, MaskRNN is the first framework of its kind to incorporate long-term temporal information through the RNN mechanism, which influences predictions based on preceding video frames.

The binary segmentation network combines appearance cues from the video frames with motion information extracted from optical flow. This dual-stream approach provides a more refined segmentation output by integrating motion boundaries, aiding in the accurate detection of moving objects against cluttered backgrounds. Meanwhile, the object localization network strengthens the segmentation by employing bounding box regression. This technique limits segmentation outputs to realistic bounds and mitigates outlier effects.

Experimental Evaluation

The evaluation of MaskRNN highlights its superior performance across three challenging datasets—DAVIS-2016, DAVIS-2017, and Segtrack v2—achieving state-of-the-art results. As demonstrated in the ablation paper, the performance improvements are attributed to various components such as the use of optical flow for mask warping and the integration of the object localization network. These enhancements improve the intersection over union (IoU), contour precision, recall, and temporal stability metrics.

Quantitatively, MaskRNN outperformed competing methods by substantial margins: improving IoU scores by 5.6% on DAVIS-2017 and 4.6% on Segtrack v2 compared to the latest semi-supervised approaches. These results affirm the effectiveness of blending segmentation masks with location priors, a feature absent in earlier frameworks such as OSVOS and FusionSeg.

Implications and Future Directions

The findings of this paper have both practical and theoretical implications. Practically, MaskRNN's ability to maintain the integrity of object segmentation in complex scenes presents significant advancements for content creation and video processing tasks, potentially reducing manual efforts involved in video editing processes. Theoretically, the integration of long-term temporal dependencies in video segmentation could pave the way for future exploration in RNN architectures and their applications in dynamic video environments.

Future research could explore further optimizations in RNNs for video object segmentation, potentially incorporating advanced motion prediction models to enhance bounding box accuracy. Additionally, there's room to investigate cross-domain applicability, such as integrating MaskRNN in real-time surveillance or autonomous navigation systems, where instance-level segmentation with temporal coherence is crucial.

In conclusion, MaskRNN constitutes a valuable contribution to the domain of video object segmentation. Its novel combination of deep learning techniques exemplifies the promising direction of integrating temporal dependencies and spatial localization for improved object segmentation accuracy.