Mask Propagation for Efficient Video Semantic Segmentation (2310.18954v1)

Published 29 Oct 2023 in cs.CV and cs.AI

Abstract: Video Semantic Segmentation (VSS) involves assigning a semantic label to each pixel in a video sequence. Prior work in this field has demonstrated promising results by extending image semantic segmentation models to exploit temporal relationships across video frames; however, these approaches often incur significant computational costs. In this paper, we propose an efficient mask propagation framework for VSS, called MPVSS. Our approach first employs a strong query-based image segmentor on sparse key frames to generate accurate binary masks and class predictions. We then design a flow estimation module utilizing the learned queries to generate a set of segment-aware flow maps, each associated with a mask prediction from the key frame. Finally, the mask-flow pairs are warped to serve as the mask predictions for the non-key frames. By reusing predictions from key frames, we circumvent the need to process a large volume of video frames individually with resource-intensive segmentors, alleviating temporal redundancy and significantly reducing computational costs. Extensive experiments on VSPW and Cityscapes demonstrate that our mask propagation framework achieves SOTA accuracy and efficiency trade-offs. For instance, our best model with Swin-L backbone outperforms the SOTA MRCFA using MiT-B5 by 4.0% mIoU, requiring only 26% FLOPs on the VSPW dataset. Moreover, our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former baseline with only up to 2% mIoU degradation on the Cityscapes validation set. Code is available at https://github.com/ziplab/MPVSS.

Authors (7)

Yuetian Weng (4 papers)
Mingfei Han (15 papers)
Haoyu He (27 papers)
Mingjie Li (67 papers)
Lina Yao (194 papers)
Xiaojun Chang (148 papers)
Bohan Zhuang (79 papers)

Citations (7)

View on Semantic Scholar

Summary

Efficient Mask Propagation for Video Semantic Segmentation

The paper "Mask Propagation for Efficient Video Semantic Segmentation" addresses the critical challenge in video semantic segmentation (VSS) - reducing computational cost without sacrificing accuracy. VSS aims to categorize each pixel within a video sequence, requiring the ability to process a voluminous amount of data compared to image segmentation, often leading to increased computational demands.

Proposed Framework: MPVSS

This research introduces an innovative framework for VSS called MPVSS, which leverages a combination of query-based segmentation and flow prediction strategies to propagate masks efficiently across video frames.

Segmentation on Key Frames: The authors employ Mask2Former, a query-based image segmentation model, to generate segmentations accurately on sparse key frames. These key frames are used as reference points for predicting masks of adjacent non-key frames.
Query-Based Flow Estimation: Diverging from traditional optical flow that estimates dense pixel motion, the research introduces a query-based flow estimation technique. This method generates specific flow maps for each segment-level mask prediction from the key frames. By doing so, the model captures motion dynamics more effectively, accommodating the displacement of visual elements between frames.
Efficient Mask Propagation: Using these segment-specific flow maps, MPVSS warps the mask predictions from key frames to non-key frames. This technique capitalizes on temporal redundancy in videos, significantly reducing the need to process each frame independently through resource-intensive models.

Experimental Results and Performance

The efficacy of MPVSS is substantiated by comprehensive experiments conducted on the standard VSPW and Cityscapes datasets. Notable numerical results include:

On the VSPW dataset, MPVSS with Swin-L backbone surpasses the state-of-the-art MRCFA with a superior performance increase of 4.0% in mean Intersection over Union (mIoU), while utilizing only 26% of FLOPs.
On Cityscapes, MPVSS reduces computational demand by up to 4x compared to the per-frame Mask2Former baseline, exhibiting negligible degradation (up to 2%) in mIoU.

These results signify the practical advantage of the proposed method in achieving competitive segmentation accuracy with reduced computational cost, demonstrating a favorable accuracy-efficiency trade-off.

Theoretical Contributions and Implications

The theoretical innovation primarily lies in the shift from dense optical flow to query-based flow estimation for VSS. This approach not only enhances mask propagation accuracy by focusing on segment-level motion but also potentially opens avenues for applying similar strategies in other video analysis tasks such as object tracking or action recognition.

The research invites further explorations in:

Generalization: Investigating whether the proposed framework can generalize effectively to diverse video types beyond the tested benchmarks.
Real-Time Applications: Assessing the real-world computational gains and limitations when deploying MPVSS for real-time video processing tasks.
Extension to Complex Scenes: Exploring enhancements in handling highly dynamic scenes or scenes with significant occlusions and camera movements.

Conclusion

The introduction of MPVSS offers a promising solution to the computational challenges of video semantic segmentation. By innovatively exploiting temporal patterns through segment-aware flow maps, MPVSS sets a precedent for future research in efficient video segmentation techniques. The results underscore the framework's potential to improve the efficiency and scalability of deep learning models in video data analysis, contributing significantly to the broader field of computer vision and artificial intelligence.

PDF Markdown