Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet (1411.1045v4)

Published 4 Nov 2014 in cs.CV, q-bio.NC, and stat.AP

Abstract: Recent results suggest that state-of-the-art saliency models perform far from optimal in predicting fixations. This lack in performance has been attributed to an inability to model the influence of high-level image features such as objects. Recent seminal advances in applying deep neural networks to tasks like object recognition suggests that they are able to capture this kind of structure. However, the enormous amount of training data necessary to train these networks makes them difficult to apply directly to saliency prediction. We present a novel way of reusing existing neural networks that have been pretrained on the task of object recognition in models of fixation prediction. Using the well-known network of Krizhevsky et al. (2012), we come up with a new saliency model that significantly outperforms all state-of-the-art models on the MIT Saliency Benchmark. We show that the structure of this network allows new insights in the psychophysics of fixation selection and potentially their neural implementation. To train our network, we build on recent work on the modeling of saliency as point processes.

Citations (395)

View on Semantic Scholar

Summary

The paper introduces a CNN-based model that leverages fixed ImageNet feature maps to predict human fixations accurately.
It utilizes a modified Krizhevsky architecture by removing fully connected layers to focus on spatially significant features.
Empirical results show a jump in explained information gain from 34% to 56% on MIT benchmarks, surpassing previous models.

An Academic Overview of "Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet"

The paper "Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet" contributes significantly to the field of computational neuroscience and computer vision by integrating pretrained deep neural networks into models predicting human fixation points in images. This paper presents an innovative approach to addressing the limitations of existing saliency models, which traditionally struggle to incorporate high-level image features such as objects.

Key Methodology

The authors employ a model architecture that utilizes a convolutional neural network (CNN) originally optimized for object recognition using the large ImageNet dataset. This network, known for its application in tasks requiring enormous amounts of data, is repurposed while maintaining fixed parameters for the layers trained on object recognition. The model specifically leverages the Krizhevsky architecture, a renowned deep network, to create a high-dimensional feature space that is subsequently used for fixation prediction.

The CNN is adapted by removing its fully connected layers to focus on spatially significant features. The paper details a novel application of probabilistic modeling for saliency maps using point processes, yielding improved predictions of fixation points over prior models. A linear combination of the network's convolutional responses, normalized across the dataset, forms the basis for generating the final saliency maps.

Numerical Results and Performance

The model's performance metrics, particularly its log-likelihoods, exhibit notable improvements over previous state-of-the-art models, such as eDN, on benchmarks like the MIT Saliency Benchmark. Quantitatively, Deep Gaze I achieves an increase in explained information gain to 56% from 34%, an enhancement underscored by its exceptional performance in AUC (Area Under the Curve) metrics, substantially outperforming other contemporary models.

The paper meticulously reports the model's superior handling of high-level features like faces and text, leveraging feature maps from the deep network to capture abstract high-level features such as visual popout phenomena. Through comprehensive analysis, the convolutional features from the uppermost layer are shown to be most predictive of human fixations, demonstrating the model's sophisticated abstraction capabilities.

Practical and Theoretical Implications

Practically, the integration of object-trained networks into saliency prediction models signifies a broader potential applicability of pretrained neural networks, offering robust capabilities with minimal adjustments to new tasks. Theoretical implications include a deeper understanding of the neural correlates of saliency, as the model provides insights into fixation prediction mechanism closely aligned with human visual processing. The paper postulates future research avenues that extend neural network attention models by incorporating such pretrained features, potentially enhancing object recognition systems to align more closely with human perceptual processes.

Future Directions in AI

The paper suggests that future developments could explore the application of more advanced networks like VGG or GoogLeNet for saliency tasks, motivated by the larger information gain potential of deep layers trained with expansive datasets like ImageNet. The use of point process-based optimization could further refine fixation prediction models, fostering advancements in AI that intersect both computer vision and neuroscience.

In summary, this paper presents a compelling case for the use of deep neural networks pretrained on large datasets to significantly enhance the prediction of human fixations, setting a new benchmark in the field, while paving the way for future explorations into the neural basis of visual attention. The paper stands as a testament to the transformative power of deep learning in elucidating complex cognitive functions and presents new opportunities for cross-disciplinary advancements.

PDF Markdown