Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model (1611.09571v4)

Published 29 Nov 2016 in cs.CV

Abstract: Data-driven saliency has recently gained a lot of attention thanks to the use of Convolutional Neural Networks for predicting gaze fixations. In this paper we go beyond standard approaches to saliency prediction, in which gaze maps are computed with a feed-forward network, and present a novel model which can predict accurate saliency maps by incorporating neural attentive mechanisms. The core of our solution is a Convolutional LSTM that focuses on the most salient regions of the input image to iteratively refine the predicted saliency map. Additionally, to tackle the center bias typical of human eye fixations, our model can learn a set of prior maps generated with Gaussian functions. We show, through an extensive evaluation, that the proposed architecture outperforms the current state of the art on public saliency prediction datasets. We further study the contribution of each key component to demonstrate their robustness on different scenarios.

Authors (4)

Marcella Cornia (61 papers)
Lorenzo Baraldi (68 papers)
Giuseppe Serra (39 papers)
Rita Cucchiara (142 papers)

Citations (539)

View on Semantic Scholar

Summary

Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model

This paper introduces a novel approach to saliency prediction using an LSTM-based saliency attentive model. The authors address existing limitations in traditional methods by leveraging a Convolutional Long Short-Term Memory (ConvLSTM) network to enhance the prediction of human eye fixations on images. This method integrates neural attentional mechanisms, refining the saliency map iteratively by focusing on salient regions.

Core Contributions

Attentive ConvLSTM: The introduction of ConvLSTM in the context of saliency prediction is innovative. Unlike conventional usage across temporal sequences, this adaptation iteratively refines saliency features, processing spatial imagery without time dependence. This mechanism generates a refined stack of feature maps, resulting in a sequential enhancement of initial predictions.
Learned Priors: The model incorporates learned priors to address the center bias inherent in human gaze patterns. Instead of utilizing predefined spatial biases, the model learns Gaussian functions' parameters, maintaining an end-to-end trainable network. This allows the system to automatically learn and produce prior maps that significantly improve prediction accuracy.
Dilated Convolutional Networks (DCN): By employing dilated convolutions on VGG-16 and ResNet-50 architectures, the model mitigates the performance drawbacks caused by the spatial downscaling of feature maps. This technique maintains convolutional filter operation scale, thus preserving spatial resolution and enhancing detail capture.

Performance and Evaluation

The proposed model notably outperformed existing methods across several public datasets including SALICON, MIT300, and CAT2000. Specifically, the architecture showed marked improvements in Normalized Scanpath Saliency (NSS), Correlation Coefficient (CC), and Area Under the ROC Curve (AUC) metrics, indicating a superior balance across different evaluation criteria.

For example, the SAM-ResNet configuration achieved an NSS of 3.204 on the SALICON test set, surpassing earlier ResNet-based methods. The model's integration of attentive mechanisms and learned priors facilitated this improvement by emphasizing different image regions progressively, thus aligning predicted saliency maps closer to human fixation data.

Implications and Future Work

The ConvLSTM architecture offers a promising direction for integrating attentional mechanisms within deep learning models for vision tasks. This paper lays the groundwork for further exploration into similar iterative refinement methodologies across other domains, such as action recognition and video analysis.

Practically, the ability to predict human attention with higher accuracy holds potential for enhancing user experience in fields such as digital marketing, video compression, and robotics. The paper also opens pathways for more comprehensive exploration into automatic learned priors that adapt to complex biases in human perception.

Future research could explore extending this model to more dynamic visual environments, including video sequences where temporal dependencies become significant again. The adaptability of the ConvLSTM framework may also be tested across varied datasets to validate its robustness and generalization capabilities further. As AI models become more embedded in real-time applications, the adaptive and efficient nature of this approach will be invaluable.

PDF Markdown

Related Papers

Find Related Papers