Video Salient Object Detection via Fully Convolutional Networks (1702.00871v3)

Published 2 Feb 2017 in cs.CV

Abstract: This paper proposes a deep learning model to efficiently detect salient regions in videos. It addresses two important issues: (1) deep video saliency model training with the absence of sufficiently large and pixel-wise annotated video data, and (2) fast video saliency training and detection. The proposed deep video saliency network consists of two modules, for capturing the spatial and temporal saliency information, respectively. The dynamic saliency model, explicitly incorporating saliency estimates from the static saliency model, directly produces spatiotemporal saliency inference without time-consuming optical flow computation. We further propose a novel data augmentation technique that simulates video training data from existing annotated image datasets, which enables our network to learn diverse saliency information and prevents overfitting with the limited number of training videos. Leveraging our synthetic video data (150K video sequences) and real videos, our deep video saliency model successfully learns both spatial and temporal saliency cues, thus producing accurate spatiotemporal saliency estimate. We advance the state-of-the-art on the DAVIS dataset (MAE of .06) and the FBMS dataset (MAE of .07), and do so with much improved speed (2fps with all steps).

Authors (3)

Wenguan Wang (103 papers)
Jianbing Shen (96 papers)
Ling Shao (244 papers)

Citations (556)

View on Semantic Scholar

Summary

The paper integrates static and dynamic saliency modules to generate pixel-wise saliency maps from complex video data.
It employs synthetic video data generation from annotated image datasets to overcome the scarcity of labeled video data.
The approach achieves competitive performance with MAE of 0.06–0.07 and processes at 2 fps without expensive optical flow.

Video Salient Object Detection via Fully Convolutional Networks

The paper presents a novel deep learning model designed to address video salient object detection through fully convolutional networks (FCNs), with specific focus on overcoming limitations in video data availability and computational efficiency. By integrating spatial and temporal saliency detection modules, this work offers a comprehensive approach to video saliency, embodying the premise that dynamic scenes require multi-faceted analyses for effective salient object detection.

Core Contributions

The significant contributions of this paper are threefold:

Integration of Static and Dynamic Saliency: The model consists of two integral components, each responsible for capturing distinct aspects of saliency. The static saliency network exploits fully convolutional networks to extract spatial saliency from individual frames. Meanwhile, the dynamic saliency network is trained to integrate spatiotemporal information, utilizing frame pairs and static saliency outputs to enhance saliency prediction.
Synthetic Video Data Generation: The most innovative aspect of this paper is the synthetic data augmentation approach that leverages large-scale annotated image datasets to produce video-like training data. This synthetic generation technique resolves the scarcity of labeled video data, thereby facilitating the training of deep networks for video saliency without significant overfitting.
Efficiency and Performance: The paper reports competitive performance on the DAVIS and FBMS datasets with mean absolute error (MAE) of 0.06 and 0.07, respectively. Notably, the computational efficiency is markedly improved, achieving processing speeds of 2 frames per second (fps) without relying on traditional optical flow computations. This is a substantial improvement over conventional methods burdened by computationally intensive steps.

Technical Approach

The architecture is rooted in fully convolutional networks, tailored to output pixel-wise saliency maps, thereby advancing beyond the limitations of prior region-based CNN approaches. By pre-training the static network using large-scale image datasets such as ImageNet, and further refining on synthetic video data, the researchers utilize a robust framework capable of effective saliency detection in complex, dynamic environments.

The dynamic network benefits from frame pairs to capture motion cues, integrating static salience predictions as priors to enable accurate spatiotemporal saliency estimation. This design cleverly bypasses the need for expensive optical flow computations, allowing the model to operate efficiently on typical graphical processing units (GPUs).

Implications and Future Directions

The implications of this research extend towards various computer vision applications, including video summarization, object detection, and autonomous systems that require real-time processing capabilities. The method's reliance on synthetic data generation presents a versatile way forward in contexts where video data is hard to annotate or acquire.

Future developments might focus on enhancing the model’s adaptability to unseen scenarios or exploring its applicability in multi-object saliency detection tasks. Further exploration of synthetic data techniques could also provide deeper insights into generating more complex motion patterns or simulating diverse environmental conditions.

Overall, while the proposed model shows promise in alleviating specific challenges inherent in video saliency detection, continued investigations into scalability and efficiency could further strengthen its applicability across a broader range of video analysis tasks.

PDF Markdown