Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model
This paper introduces a novel approach to saliency prediction using an LSTM-based saliency attentive model. The authors address existing limitations in traditional methods by leveraging a Convolutional Long Short-Term Memory (ConvLSTM) network to enhance the prediction of human eye fixations on images. This method integrates neural attentional mechanisms, refining the saliency map iteratively by focusing on salient regions.
Core Contributions
- Attentive ConvLSTM: The introduction of ConvLSTM in the context of saliency prediction is innovative. Unlike conventional usage across temporal sequences, this adaptation iteratively refines saliency features, processing spatial imagery without time dependence. This mechanism generates a refined stack of feature maps, resulting in a sequential enhancement of initial predictions.
- Learned Priors: The model incorporates learned priors to address the center bias inherent in human gaze patterns. Instead of utilizing predefined spatial biases, the model learns Gaussian functions' parameters, maintaining an end-to-end trainable network. This allows the system to automatically learn and produce prior maps that significantly improve prediction accuracy.
- Dilated Convolutional Networks (DCN): By employing dilated convolutions on VGG-16 and ResNet-50 architectures, the model mitigates the performance drawbacks caused by the spatial downscaling of feature maps. This technique maintains convolutional filter operation scale, thus preserving spatial resolution and enhancing detail capture.
Performance and Evaluation
The proposed model notably outperformed existing methods across several public datasets including SALICON, MIT300, and CAT2000. Specifically, the architecture showed marked improvements in Normalized Scanpath Saliency (NSS), Correlation Coefficient (CC), and Area Under the ROC Curve (AUC) metrics, indicating a superior balance across different evaluation criteria.
For example, the SAM-ResNet configuration achieved an NSS of 3.204 on the SALICON test set, surpassing earlier ResNet-based methods. The model's integration of attentive mechanisms and learned priors facilitated this improvement by emphasizing different image regions progressively, thus aligning predicted saliency maps closer to human fixation data.
Implications and Future Work
The ConvLSTM architecture offers a promising direction for integrating attentional mechanisms within deep learning models for vision tasks. This paper lays the groundwork for further exploration into similar iterative refinement methodologies across other domains, such as action recognition and video analysis.
Practically, the ability to predict human attention with higher accuracy holds potential for enhancing user experience in fields such as digital marketing, video compression, and robotics. The paper also opens pathways for more comprehensive exploration into automatic learned priors that adapt to complex biases in human perception.
Future research could explore extending this model to more dynamic visual environments, including video sequences where temporal dependencies become significant again. The adaptability of the ConvLSTM framework may also be tested across varied datasets to validate its robustness and generalization capabilities further. As AI models become more embedded in real-time applications, the adaptive and efficient nature of this approach will be invaluable.