Visual Saliency Based on Multiscale Deep Features (1503.08663v3)

Published 30 Mar 2015 in cs.CV

Abstract: Visual saliency is a fundamental problem in both cognitive and computational sciences, including computer vision. In this CVPR 2015 paper, we discover that a high-quality visual saliency model can be trained with multiscale features extracted using a popular deep learning architecture, convolutional neural networks (CNNs), which have had many successes in visual recognition tasks. For learning such saliency models, we introduce a neural network architecture, which has fully connected layers on top of CNNs responsible for extracting features at three different scales. We then propose a refinement method to enhance the spatial coherence of our saliency results. Finally, aggregating multiple saliency maps computed for different levels of image segmentation can further boost the performance, yielding saliency maps better than those generated from a single segmentation. To promote further research and evaluation of visual saliency models, we also construct a new large database of 4447 challenging images and their pixelwise saliency annotation. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks, improving the F-Measure by 5.0% and 13.2% respectively on the MSRA-B dataset and our new dataset (HKU-IS), and lowering the mean absolute error by 5.7% and 35.1% respectively on these two datasets.

Authors (2)

Guanbin Li (177 papers)
Yizhou Yu (148 papers)

Citations (1,191)

View on Semantic Scholar

Summary

Visual Saliency Based on Multiscale Deep Features

The paper "Visual Saliency Based on Multiscale Deep Features" by Guanbin Li and Yizhou Yu presents a novel approach to visual saliency estimation utilizing multiscale deep features derived from convolutional neural networks (CNNs). Visual saliency, a measure of the prominence of various regions in an image, has profound implications in fields ranging from cognitive sciences to computer vision. This research emphasizes the utility of deep learning techniques in advancing the precision of visual saliency models.

Neural Network Architecture and Saliency Model

The proposed neural network architecture incorporates CNNs responsible for feature extraction at three scales: small (image region), medium (neighboring regions), and large (entire image), collectively termed as S-3CNN. These multiscale features are concatenated and fed into fully connected neural network layers which serve as a regressor for inferring saliency scores. This neural network architecture, detailed in the paper's methodology, stretches beyond simple feature extraction, allowing the model to assess region contrast both locally and globally within an image.

Enhancement Techniques

To refine spatial coherence in the saliency results, the authors introduced a refinement method that employs mean saliency values over superpixels, using an edge-preserving regularization scheme. Additionally, the paper incorporates a technique to aggregate multiple saliency maps computed for different levels of hierarchical image segmentation. This multi-level fusion is achieved using a linear combination approach which is optimized based on validation datasets.

New Dataset: HKU-IS

The authors address the limitations of existing datasets by constructing a new challenging dataset, HKU-IS, comprising 4447 images with detailed pixelwise annotation. This dataset includes images with multiple salient objects, objects touching the image boundary, and low contrast scenarios, enhancing the robustness and validation of the proposed approach.

Experimental Validation

The experimental results affirm the superiority of the proposed model, which achieved state-of-the-art performance on various public benchmarks, including MSRA-B, SED, SOD, and iCoSeg. Key performance metrics such as F-Measure and Mean Absolute Error (MAE) highlight significant improvements — particularly, the method enhanced the F-Measure by 5.0% on the MSRA-B dataset and reduced the MAE by 35.1% on the HKU-IS dataset compared to previous methods.

Practical and Theoretical Implications

The incorporation of multiscale CNN features into visual saliency models suggests a paradigm where deep learning can capture and exploit intricate feature hierarchies for superior saliency detection. Practically, these findings can enhance performance across a spectrum of computer vision tasks, including image segmentation, object recognition, and scene understanding. Theoretically, the work underscores the importance of multi-scale analysis and hierarchical structures in enhancing model accuracy and robustness.

Future Speculation

Future developments could explore extending this approach to dynamic scenes or videos, taking into account temporal coherence and motion cues. Further, integrating transformer-based architectures with CNN features may propel advancements in capturing long-range dependencies and context within images.

In summary, the paper presents a comprehensive and technically sound method for visual saliency estimation, leveraging multiscale deep features to achieve superior accuracy and robustness. This approach paves the way for further explorations in leveraging deep learning for intricate visual tasks, cementing the role of hierarchical and multiscale representations in computer vision.

PDF Markdown