Revisiting Video Saliency: A Large-scale Benchmark and a New Model (1801.07424v3)

Published 23 Jan 2018 in cs.CV

Abstract: In this work, we contribute to video saliency research in two ways. First, we introduce a new benchmark for predicting human eye movements during dynamic scene free-viewing, which is long-time urged in this field. Our dataset, named DHF1K (Dynamic Human Fixation), consists of 1K high-quality, elaborately selected video sequences spanning a large range of scenes, motions, object types and background complexity. Existing video saliency datasets lack variety and generality of common dynamic scenes and fall short in covering challenging situations in unconstrained environments. In contrast, DHF1K makes a significant leap in terms of scalability, diversity and difficulty, and is expected to boost video saliency modeling. Second, we propose a novel video saliency model that augments the CNN-LSTM network architecture with an attention mechanism to enable fast, end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. We thoroughly examine the performance of our model, with respect to state-of-the-art saliency models, on three large-scale datasets (i.e., DHF1K, Hollywood2, UCF sports). Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that our model outperforms other competitors.

Authors (5)

Wenguan Wang (103 papers)
Jianbing Shen (96 papers)
Fang Guo (12 papers)
Ming-Ming Cheng (185 papers)
Ali Borji (89 papers)

Citations (252)

View on Semantic Scholar

Summary

Video Saliency Prediction: Benchmark and Novel Model

The research work presented in the paper "Revisiting Video Saliency: A Large-scale Benchmark and a New Model" significantly advances the field of video saliency prediction, an area which has not been as thoroughly investigated compared to static scene viewing. The authors address two critical contributions: the development of the extensive DHF1K dataset for benchmarking video saliency models, and the introduction of a novel deep learning architecture that enhances saliency prediction performance.

DHF1K Dataset

DHF1K is a large-scale video dataset specifically curated for dynamic scene free-viewing with the intent to overcome the limitations of existing datasets which lack diversity and complexity. This dataset encompasses 1,000 video sequences spanning a variety of scenes, motions, and object types with challenging backgrounds, all annotated with human eye movement data collected from 17 observers using eye-tracking devices. The dataset stands out due to its scale, complexity, and the inclusion of a test set reserved for unbiased model evaluation. Such an extensive dataset is anticipated to be pivotal for advancing the modeling of human attentional mechanisms in dynamic scenarios.

Proposed Model

The authors introduce a novel video saliency prediction model based on a CNN-LSTM architecture, enhanced with a supervised attention mechanism. This model leverages convolutional neural networks (CNNs) and long short-term memory networks (LSTMs) to process static and temporal information respectively. A distinctive attention module is employed, which is fine-tuned using static saliency data to better capture spatial saliency information, thus allowing the LSTM to focus on learning sequential saliency dynamics more effectively. The integration of static saliency encoding prevents overfitting and enhances both training efficiency and model performance—an approach not previously explored in video saliency models.

Performance Evaluation

Extensive evaluations on DHF1K, Hollywood-2, and UCF sports datasets comprising over 1.2K testing videos underline the effectiveness of the proposed model. It consistently outperforms a range of both dynamic and static saliency models across multiple metrics such as AUC-Judd, NSS, and CC. The success of the model is attributed to its innovative architecture that separates and optimally combines the static and temporal components of attention prediction.

Implications and Future Work

The methodologies and findings reported in the paper signify substantial improvements in video saliency prediction by incorporating robust deep learning techniques and novel data usage strategies. This work not only elevates the state-of-the-art in dynamic saliency modeling but also sets a foundational benchmark with DHF1K for future research.

Despite its strengths, the model's reliance on pre-trained static saliency data signals an area for further innovation, potentially in developing models that can autonomously learn and balance static and dynamic attentional cues. Additionally, exploring alternative neural architectures, such as those leveraging transformer-based models which have shown promise in sequence prediction tasks, could offer further enhancements in predictive accuracy and computational efficiency.

In summary, this research presents a commendable effort in enhancing both the tools available for and the methods of video saliency prediction. The proposed model and the DHF1K dataset are poised to stimulate further developments and investigations into understanding human visual attention in complex and dynamic environments.

PDF Markdown

Related Papers

Find Related Papers