Deep Visual Attention Prediction (1705.02544v3)

Published 7 May 2017 in cs.CV

Abstract: In this work, we aim to predict human eye fixation with view-free scenes based on an end-to-end deep learning architecture. Although Convolutional Neural Networks (CNNs) have made substantial improvement on human attention prediction, it is still needed to improve CNN based attention models by efficiently leveraging multi-scale features. Our visual attention network is proposed to capture hierarchical saliency information from deep, coarse layers with global saliency information to shallow, fine layers with local saliency response. Our model is based on a skip-layer network structure, which predicts human attention from multiple convolutional layers with various reception fields. Final saliency prediction is achieved via the cooperation of those global and local predictions. Our model is learned in a deep supervision manner, where supervision is directly fed into multi-level layers, instead of previous approaches of providing supervision only at the output layer and propagating this supervision back to earlier layers. Our model thus incorporates multi-level saliency predictions within a single network, which significantly decreases the redundancy of previous approaches of learning multiple network streams with different input scales. Extensive experimental analysis on various challenging benchmark datasets demonstrate our method yields state-of-the-art performance with competitive inference time.

Citations (573)

View on Semantic Scholar

Summary

The paper introduces a novel CNN model that integrates deep supervision at multiple layers to enhance human visual attention prediction.
It employs an encoder-decoder architecture with skip connections to efficiently merge multi-scale saliency cues from different convolutional layers.
Experimental results across five benchmark datasets show competitive accuracy and a real-time inference speed of 10fps on a GPU.

Deep Visual Attention Prediction

The paper "Deep Visual Attention Prediction" by Wenguan Wang and Jianbing Shen presents an advanced framework for predicting human visual attention using a convolutional neural network (CNN)-based approach. The proposed model addresses limitations in existing CNN models for human attention prediction by leveraging multi-scale features effectively.

Overview of the Approach

The authors propose a skip-layer network structure that combines multi-level saliency predictions within a single network. Unlike previous methods which primarily offer supervision at the output layer, this model incorporates deep supervision across multiple layers. This architecture enables the prediction of human attention by utilizing hierarchical representations from multiple convolutional layers with differing receptive fields.

Methodology

The core network architecture consists of an encoder-decoder structure, inspired by VGG16, comprising convolutional layers for feature extraction and deconvolutional layers for feature map upsampling. The model employs deep supervision to enhance both the discriminative capabilities and robustness of the saliency features. This is achieved by directly feeding supervision into intermediate layers, reducing redundant computation and aiding the integration of multi-scale saliency cues.

Key Components:

Encoder-Decoder Architecture: The encoder extracts hierarchical features, while the decoder reconstructs dense saliency maps from these features.
Deep Supervision: Supervision is applied at multiple layers to improve intermediate representations and overall prediction accuracy.
Multi-Scale Integration: Through skip-layer connections, the model efficiently captures both local and global saliency information, addressing multiple scales within a unified framework.

Experimental Results

The authors validated their model across five challenging benchmark datasets: MIT300, MIT1003, TORONTO, PASCAL-S, and DUT-OMRON, achieving impressive accuracy and competitive inference time, with a frame rate of 10fps on a GPU. The proposed model demonstrated superior or comparable performance against 13 state-of-the-art models, notably achieving robust results across diverse dataset conditions.

Implications and Future Directions

The presented approach significantly contributes to the field of visual attention prediction, highlighting the efficacy of employing multi-level deep supervision within CNN architectures. By capturing fine-to-coarse level saliency features, the model provides insights that could enhance applications such as image segmentation, object recognition, and video understanding.

Future research directions could involve exploring more compact network architectures to further reduce computational load while maintaining or improving accuracy. Additionally, extending this framework to incorporate dynamic visual tasks or temporal sequences could offer enhanced solutions in video analysis and autonomous systems.

Conclusion

"Deep Visual Attention Prediction" provides a comprehensive and efficient solution to predicting human attention via an innovative CNN architecture. The integration of multi-level deep supervision and skip-layer connectivity demonstrates a robust methodology for saliency detection, offering valuable contributions to both academic research and practical applications in computer vision.

PDF Markdown