Attention-guided Temporally Coherent Video Object Matting (2105.11427v3)

Published 24 May 2021 in cs.CV

Abstract: This paper proposes a novel deep learning-based video object matting method that can achieve temporally coherent matting results. Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength for video matting networks. This module computes temporal correlations for pixels adjacent to each other along the time axis in feature space, which is robust against motion noises. We also design a novel loss term to train the attention weights, which drastically boosts the video matting performance. Besides, we show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network with a sparse set of user-annotated keyframes. To facilitate video matting and trimap generation networks' training, we construct a large-scale video matting dataset with 80 training and 28 validation foreground video clips with ground-truth alpha mattes. Experimental results show that our method can generate high-quality alpha mattes for various videos featuring appearance change, occlusion, and fast motion. Our code and dataset can be found at: https://github.com/yunkezhang/TCVOM

Citations (27)

View on Semantic Scholar

Summary

The paper proposes an attention-based temporal aggregation module (TAM) that improves video object matting by aggregating features across frames to enhance temporal coherence.
Experimental results on the new VideoMatting108 dataset demonstrate significant improvements in temporal coherence and overall matting quality compared to existing methods.
The method integrates with existing image matting networks, providing a scalable solution for industries while suggesting new directions for spatio-temporal tasks.

Attention-Guided Temporally Coherent Video Object Matting

The paper "Attention-guided Temporally Coherent Video Object Matting" presents a deep learning strategy aimed at enhancing the video object matting process, ensuring temporal coherence in the resulting video matting quality. Video object matting involves calculating alpha mattes across video frames to isolate and define foreground objects, crucial for seamless video editing applications such as compositing. This task is challenging due to the need to maintain spatial consistency and minimize temporal artifacts like flicker.

Key Contributions

The central contribution of this research is the introduction of an attention-based temporal aggregation module (TAM). This module amalgamates features across consecutive video frames to enhance temporal coherence and robustness against motion artifacts, appearance variability, and occlusion. By calculating temporal correlations through attention mechanisms on the feature space rather than local color, it significantly improves the preservation of temporal consistency during video transitions.

The TAM is designed to operate with state-of-the-art image matting networks, transforming them effectively into video matting networks. Such integration maximizes the performance capabilities of existing CNN architectures without significant computation overhead.

Additionally, the paper addresses the trimap generation problem, which is integral to video matting. Trimaps are essential to delineate foreground, background, and unknown regions in an image or video frame. The authors employ an STM-based approach, which leverages user-annotated keyframes to train video object segmentation networks, ensuring efficient and accurate trimap generation across video sequences.

Dataset and Results

To facilitate training and testing of their method, the authors have constructed a significant dataset named VideoMatting108. This comprises 108 foreground video clips, each with meticulous alpha matte annotation, covering a diverse range of objects and motions. Experimental results on this dataset demonstrate that the proposed method excels in generating high-quality alpha mattes across various challenging video scenarios, yielding substantial improvements in temporal coherence metrics.

The numerical evaluations support claims for significant gains in performance—particularly reducing the metrics such as dtSSD and SAD—when comparing the proposed approach with traditional methods and similar state-of-the-art techniques. The focus on temporal coherence through both the innovative TAM design and dedicated loss functions ensures the method's robustness across dynamics often encountered in practical video processing tasks.

Implications and Future Work

The implications of this research are notable for both practical and theoretical domains. Practically, it provides a scalable and effective solution for industries involved in video editing and post-processing. Theoretically, it suggests new pathways for refining existing image processing and computer vision tasks by emphasizing feature-level temporal consistency.

Future developments could explore the integration of the temporal aggregation approach in broader contexts, such as optimizing neural networks for real-time applications or enhancing other spatio-temporal tasks. Further investigations might also delve into weakly supervised approaches to reduce dependency on comprehensive ground-truth annotations, addressing one of the remaining limitations in training resources.

In summary, the paper contributes significant advancements to video matting by merging attention mechanisms and temporal feature aggregation into cohesive and effective neural network architectures, setting foundational work for next-gen video processing techniques.

Related Papers

GitHub

GitHub - yunkezhang/TCVOM: Github project for Attention-guided Temporal Coherent Video Object Matting. (78 stars)