A Deep Multi-Level Network for Saliency Prediction (1609.01064v2)

Published 5 Sep 2016 in cs.CV

Abstract: This paper presents a novel deep architecture for saliency prediction. Current state of the art models for saliency prediction employ Fully Convolutional networks that perform a non-linear combination of features extracted from the last convolutional layer to predict saliency maps. We propose an architecture which, instead, combines features extracted at different levels of a Convolutional Neural Network (CNN). Our model is composed of three main blocks: a feature extraction CNN, a feature encoding network, that weights low and high level feature maps, and a prior learning network. We compare our solution with state of the art saliency models on two public benchmarks datasets. Results show that our model outperforms under all evaluation metrics on the SALICON dataset, which is currently the largest public dataset for saliency prediction, and achieves competitive results on the MIT300 benchmark.

Authors (4)

Marcella Cornia (61 papers)
Lorenzo Baraldi (68 papers)
Giuseppe Serra (39 papers)
Rita Cucchiara (142 papers)

Citations (338)

View on Semantic Scholar

Summary

The paper introduces a deep learning architecture that combines multi-level feature extraction with data-driven priors for enhanced saliency prediction.
It employs a modified VGG-16 network with adjusted pooling layers to capture low, medium, and high-level features for precise mapping.
Empirical evaluations on SALICON and MIT300 benchmarks demonstrate superior performance, setting new standards in saliency measurement.

A Deep Multi-Level Network for Saliency Prediction

The paper under consideration presents a refined deep learning architecture aimed at advancing saliency prediction, a crucial task in the field of computer vision. Authored by Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara, the research introduces a unique approach that leverages multi-level feature extraction for generating precise saliency maps.

Architectural Innovation

This work diverges from traditional fully convolutional networks by implementing an architecture that amalgamates features extracted across various levels of a Convolutional Neural Network (CNN). The architecture is composed of three primary blocks:

Feature Extraction Network: A fully convolutional network based on the VGG-16 model, modified to maintain a higher resolution in deeper layers. By adjusting pooling layers, the configuration effectively extracts low, medium, and high-level features, which are crucial for the subsequent stages.
Encoding Network: This stage weights the extracted features using a learned feature weighting function, generating saliency-specific feature maps that form the basis of the final saliency map.
Prior Learning Network: In contrast to hand-crafted priors, this approach incorporates a data-driven method to learn appropriate priors, contributing to a more informative saliency prediction.

Methodology and Evaluation Metrics

The authors employed a novel loss function to address the common imbalance in saliency maps, focusing on optimizing the normalized predictions with respect to human fixation ground truths. The research leverages benchmarks such as SALICON and MIT300 to validate the proposed model, with evaluation metrics including Pearson's linear correlation coefficient (CC), Normalized Scanpath Saliency (NSS), and various forms of Area Under the ROC curve (AUC).

Empirical Results

The empirical results underscore the model's efficacy, particularly on the SALICON dataset. The approach surpassed existing state-of-the-art methods in CC, AUC shuffled, and AUC Judd metrics. It also demonstrated competitive performance on the MIT300 benchmark, reinforcing the architecture's robust adaptability to widely varying datasets.

Theoretical and Practical Implications

The proposed model's significant performance improvements emphasize the value of incorporating multi-level feature extraction and learned priors in saliency prediction tasks. By meticulously combining features from various convolutional stages, the architecture aligns more closely with the complex hierarchical nature of the human visual system. Practically, this advancement opens avenues for enhanced scene understanding in domains such as autonomous vehicles, robotic vision systems, and augmented reality applications.

Future Directions

Future research could explore further optimization of feature encoding strategies to handle even larger datasets or to lessen computational complexity without compromising accuracy. Additionally, applying the model to real-time applications could provide insights into its operational efficacy in dynamic or resource-constrained environments.

Overall, this paper delivers significant contributions to the field of saliency prediction, providing a nuanced understanding of convolutional feature utilization and setting a new benchmark for future explorations.

PDF Markdown