Video Object Segmentation using Space-Time Memory Networks (1904.00607v2)

Published 1 Apr 2019 in cs.CV

Abstract: We propose a novel solution for semi-supervised video object segmentation. By the nature of the problem, available cues (e.g. video frame(s) with object masks) become richer with the intermediate predictions. However, the existing methods are unable to fully exploit this rich source of information. We resolve the issue by leveraging memory networks and learn to read relevant information from all available sources. In our framework, the past frames with object masks form an external memory, and the current frame as the query is segmented using the mask information in the memory. Specifically, the query and the memory are densely matched in the feature space, covering all the space-time pixel locations in a feed-forward fashion. Contrast to the previous approaches, the abundant use of the guidance information allows us to better handle the challenges such as appearance changes and occlussions. We validate our method on the latest benchmark sets and achieved the state-of-the-art performance (overall score of 79.4 on Youtube-VOS val set, J of 88.7 and 79.2 on DAVIS 2016/2017 val set respectively) while having a fast runtime (0.16 second/frame on DAVIS 2016 val set).

Citations (640)

View on Semantic Scholar

Summary

The paper introduces a novel space-time memory network that leverages intermediate predictions for efficient semi-supervised video object segmentation without online learning.
It achieves competitive performance with a Youtube-VOS score of 79.4 and DAVIS Jaccard indices of 88.7 and 79.2, demonstrating strong generalization and robustness.
The STM framework propagates segmentation masks across video frames in 0.16 seconds per frame, making it highly suitable for real-time video applications.

Video Object Segmentation using Space-Time Memory Networks

The paper introduces a noteworthy approach to semi-supervised video object segmentation through the implementation of Space-Time Memory Networks (STM). This method leverages memory networks to exploit the rich cues available from intermediate predictions, providing an efficient and effective solution to the segmentation task. The framework manages to utilize memory in a manner that allows for comprehensive pixel-level segmentation, leveraging the spatio-temporal embedding of video data without the necessity of online learning.

Framework Overview

The proposed STM framework consists of distinct components: query and memory encoders, a space-time memory read block, and a decoder. The encoders convert video frames and object masks into key-value embeddings, which are used to track and match corresponding pixel-level information across time. The space-time memory read operation, a central feature of the framework, performs exhaustive comparisons between the query and memory embeddings, facilitating the effective propagation of segmentation masks across video frames. This operation is akin to applying spatio-temporal attention, crucially enhancing the segmentation's robustness to occlusions, appearance shifts, and error accumulation.

Key Experimental Results

The results on benchmark datasets highlight the effectiveness of the proposed method. The STM achieves an overall score of 79.4 on the Youtube-VOS validation set, with noticeable improvements in unseen object categories, reflecting the model's generalization capabilities. Additionally, on the DAVIS 2016/2017 validation sets, the Jaccard index $\mathcal{J}$ scores of 88.7 and 79.2 indicate superior performance compared to existing methods. Notably, the method maintains a runtime of 0.16 seconds per frame, showcasing its computational efficiency.

Implications and Future Directions

This approach disrupts traditional methods' reliance on online learning, thus offering significant practical advantages. The robust handling of significant appearance changes and occlusions underscores its potential application in diverse video-related tasks such as video editing, tracking, and augmented reality applications. The ability of STM networks to interpret and leverage temporal information indicates valuable implications for real-time processing and interactive applications.

Looking forward, the extension of this framework into other domains appears promising. The utility of the space-time memory concept could be beneficial in tasks such as interactive image segmentation, inpainting, or even complex tasks like video question answering. Enhanced memory management strategies and fine-tuning on extensive datasets could further amplify the model's capabilities, pushing the boundaries of what's achievable in video segmentation.

In summary, the introduction of Space-Time Memory Networks presents a well-rounded and efficient approach to video object segmentation. Its success in benchmarks and potential for various applications underscore its value to the computer vision community, paving the way for future exploration and innovation.

PDF Markdown