DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation (1909.12471v1)

Published 27 Sep 2019 in cs.CV

Abstract: In this paper, we propose the differentiable mask-matching network (DMM-Net) for solving the video object segmentation problem where the initial object masks are provided. Relying on the Mask R-CNN backbone, we extract mask proposals per frame and formulate the matching between object templates and proposals at one time step as a linear assignment problem where the cost matrix is predicted by a CNN. We propose a differentiable matching layer by unrolling a projected gradient descent algorithm in which the projection exploits the Dykstra's algorithm. We prove that under mild conditions, the matching is guaranteed to converge to the optimum. In practice, it performs similarly to the Hungarian algorithm during inference. Meanwhile, we can back-propagate through it to learn the cost matrix. After matching, a refinement head is leveraged to improve the quality of the matched mask. Our DMM-Net achieves competitive results on the largest video object segmentation dataset YouTube-VOS. On DAVIS 2017, DMM-Net achieves the best performance without online learning on the first frames. Without any fine-tuning, DMM-Net performs comparably to state-of-the-art methods on SegTrack v2 dataset. At last, our matching layer is very simple to implement; we attach the PyTorch code ($<50$ lines) in the supplementary material. Our code is released at https://github.com/ZENGXH/DMM_Net.

Citations (75)

View on Semantic Scholar

Summary

The paper introduces a differentiable mask-matching layer that embeds a projected gradient descent algorithm with Dykstra’s projection to optimize mask matching in video segmentation.
It achieves state-of-the-art results on benchmarks like DAVIS 2017 and demonstrates robust generalization on SegTrack v2 without the need for performance-heavy online learning.
The model utilizes a Mask R-CNN backbone for extracting mask proposals, enabling efficient end-to-end training and effectively handling large deformations and appearance changes.

DMM-Net: A Differentiable Approach to Video Object Segmentation

The paper "DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation" introduces a novel approach to tackling the video object segmentation problem utilizing a Differentiable Mask-Matching Network (DMM-Net). This method is anchored on the semi-supervised setting, where the initial masks of objects are given. The distinction of DMM-Net from prevailing approaches is in its incorporation of a differentiable mask-matching component, which facilitates the integration of a linear assignment problem solver within a neural network framework.

The core innovation lies in the creation of a differentiable matching layer. This layer utilizes a projected gradient descent algorithm, coupled with Dykstra's projection, to solve an otherwise non-differentiable mask matching problem between object templates and proposals. The mathematical construction of this layer allows the end-to-end backpropagation, enabling the network to learn an optimal cost matrix effectively. The authors prove that, under several assumptions, the approximated solution converges to the optimal solution, making it a reliable substitute for the hard-to-differentiate Hungarian algorithm.

Experimentally, the proposed DMM-Net is showcased on large-scale video object segmentation datasets such as YouTube-VOS and DAVIS 2017, achieving competitive performance. Notably, DMM-Net demonstrates state-of-the-art results on DAVIS 2017 without requiring the performance-heavy online learning step, which is a significant performance bottleneck in many other methods. Furthermore, when evaluated on SegTrack v2 without any dataset-specific fine-tuning, the method's generalization capability is underscored by its competent performance.

Key to this model's success is its reliance on the Mask R-CNN backbone for extracting mask proposals, which are then matched using the differentiable layer. The extracted proposals allow the network to handle complex scenarios such as large deformations and significant appearance changes, which are common challenges in video segmentation tasks.

Implications of this research are broad and multifaceted:

Practical Implications: By avoiding heuristic approaches in mask proposal assignments, DMM-Net can more effectively adapt to unseen data, promising improvements in fields reliant on precise segmentation such as autonomous driving and surveillance.
Theoretical Contributions: The differentiable matching layer proposed in this research opens new avenues for optimization problems traditionally considered non-differentiable within neural network training frameworks. This approach could be applied to other domains where matching or assignment problems are prevalent, potentially advancing methods in areas like multi-object tracking or resource allocation in neural networks.
Future Directions: The paper hints at integrating the differentiable matching framework with other network architectures to boost performance further. Additionally, exploring longer temporal window matching or embedding the layer in even more complex network architectures presents future research opportunities.

This research exemplifies the ongoing confluence of optimization algorithms with deep learning, highlighting the transformative potential of differentiable programming in machine learning models. Consequently, DMM-Net stands as a significant milestone towards more efficient and effective video object segmentation, presenting a method that balances theoretical elegance with practical performance.

PDF Markdown

Related Papers

GitHub

GitHub - ZENGXH/DMM_Net: Differentiable Mask-Matching Network for Video Object Segmentation (ICCV 2019) (147 stars)