Video Object Segmentation with Joint Re-identification and Attention-Aware Mask Propagation (1803.04242v2)

Published 12 Mar 2018 in cs.CV

Abstract: The problem of video object segmentation can become extremely challenging when multiple instances co-exist. While each instance may exhibit large scale and pose variations, the problem is compounded when instances occlude each other causing failures in tracking. In this study, we formulate a deep recurrent network that is capable of segmenting and tracking objects in video simultaneously by their temporal continuity, yet able to re-identify them when they re-appear after a prolonged occlusion. We combine both temporal propagation and re-identification functionalities into a single framework that can be trained end-to-end. In particular, we present a re-identification module with template expansion to retrieve missing objects despite their large appearance changes. In addition, we contribute a new attention-based recurrent mask propagation approach that is robust to distractors not belonging to the target segment. Our approach achieves a new state-of-the-art global mean (Region Jaccard and Boundary F measure) of 68.2 on the challenging DAVIS 2017 benchmark (test-dev set), outperforming the winning solution which achieves a global mean of 66.1 on the same partition.

Citations (190)

View on Semantic Scholar

Summary

Overview of Video Object Segmentation with Joint Re-identification and Attention-Aware Mask Propagation

The paper "Video Object Segmentation with Joint Re-identification and Attention-Aware Mask Propagation" addresses the complex task of video object segmentation, particularly challenging when multiple instances occlude each other and exhibit significant variations in scale and pose. The authors propose a framework named DyeNet, a deep recurrent network designed to simultaneously segment and track objects through time while providing robust re-identification when objects reappear after occlusions.

Methodological Contributions

Unified Framework: The paper presents DyeNet, which amalgamates temporal propagation with re-identification, optimized in a single, end-to-end trainable network. This approach addresses persistent issues in other methodologies that often handle these tasks separately, leading to less cohesive segmentation results.
Re-identification Module: A novel component of DyeNet is the Re-ID module, enhanced by dynamic template expansion. This allows the method to effectively track previously occluded objects by adapting the initial templates with newly identified instances, thus accommodating substantial appearance transformations.
Attention-based Mask Propagation: Also introduced is an attention mechanism in the recurrent mask propagation module, significantly minimizing distraction from non-target segments during mask propagation. This attention framework sharpens focus on the relevant object regions through iterative mask extension.

Empirical Evaluation

DyeNet’s efficacy is demonstrated through notable performance resilience in various benchmarks, specifically achieving a global mean of 68.2 for Region Jaccard and Boundary F Measures on the DAVIS 2017 dataset's test-dev set. These results indicate superior performance compared to prior leading methodologies, notably outperforming the VS-ReID technique, which achieves a global mean of 66.1 on the same dataset partition.

Implications and Future Considerations

The analytical significance of DyeNet lies in its comprehensive treatment of video object segmentation through a unified model architecture that efficiently handles occlusions and object variability. Practically, this can enhance applications in areas such as autonomous driving, surveillance, and augmented reality, where reliable object segmentation is paramount.

Given DyeNet's end-to-end learnability and impressive accuracy without heavy dependency on online training, future developments could focus on reducing computational costs further and improving real-time applicability. Additionally, the attention mechanism's integration opens avenues for more nuanced discrimination between foreground and background elements, potentially applicable in more sophisticated environments where object discrimination is complex.

In conclusion, DyeNet represents a considerable advance in the video object segmentation domain, offering a robust solution to occlusion and variability challenges. Its architecture not only sets a new benchmark for segmentation accuracy but also champions efficiency, implying significant practical potential across real-world implementations.