DeepGaze II: Reading fixations from deep features trained on object recognition (1610.01563v1)

Published 5 Oct 2016 in cs.CV, q-bio.NC, and stat.AP

Abstract: Here we present DeepGaze II, a model that predicts where people look in images. The model uses the features from the VGG-19 deep neural network trained to identify objects in images. Contrary to other saliency models that use deep features, here we use the VGG features for saliency prediction with no additional fine-tuning (rather, a few readout layers are trained on top of the VGG features to predict saliency). The model is therefore a strong test of transfer learning. After conservative cross-validation, DeepGaze II explains about 87% of the explainable information gain in the patterns of fixations and achieves top performance in area under the curve metrics on the MIT300 hold-out benchmark. These results corroborate the finding from DeepGaze I (which explained 56% of the explainable information gain), that deep features trained on object recognition provide a versatile feature space for performing related visual tasks. We explore the factors that contribute to this success and present several informative image examples. A web service is available to compute model predictions at http://deepgaze.bethgelab.org.

Citations (279)

View on Semantic Scholar

Summary

The paper introduces DeepGaze II, a novel approach that uses fixed VGG-19 features combined with a learned 1x1 convolutional readout network to predict eye fixation patterns.
The paper achieves a significant performance boost by explaining approximately 87% of the explainable information gain, markedly improving upon DeepGaze I's 56%.
The paper validates its methodology on the MIT300 benchmark, underscoring the effective application of transfer learning in saliency prediction for eye tracking.

An Analysis of "DeepGaze II: Reading Fixations from Deep Features Trained on Object Recognition"

Overview

The paper introduces DeepGaze II, a model designed for predicting human eye fixations in images through the use of deep features derived from a pre-trained VGG-19 network. It employs the paradigm of transfer learning without further fine-tuning the VGG features—only additional readout layers are trained for generating the saliency maps. This distinct approach allows DeepGaze II to serve as a robust evaluation of transfer learning capabilities, particularly in domains constrained by relatively small datasets.

Methodology

DeepGaze II employs a probabilistic framework where the image under scrutiny is processed through the VGG-19 layers. The model leverages the convolutional layers to produce feature maps, combined in a subsequent readout network consisting of four layers of 1x1 convolutions. The readout network learns a non-linear mapping of the extracted deep features to predict saliency maps. Crucially, this architecture maintains the pre-trained VGG features intact, thus enforcing the VGG network’s learned feature space to be versatile for the saliency prediction task.

The training involves a two-phase approach: pre-training on the SALICON dataset followed by fine-tuning on the MIT1003 dataset with cross-validation methods to prevent overfitting. Evaluation through log-likelihoods and information gain metrics positions DeepGaze II favorably compared to its predecessors and contemporaries in object-centric saliency prediction tasks.

Results and Performance

DeepGaze II demonstrates strong empirical results, explaining approximately 87% of the explainable information gain from fixation patterns. It surpasses its precursor, DeepGaze I, which reported 56% explainable gain. Such improvements underline its robust saliency prediction capability, marking a significant step towards closing the gap with gold standard predictions, which leverage subject-wise data aggregation.

The model also achieves high performance on the MIT300 benchmark, demonstrating top-tier effectiveness in traditional evaluation metrics such as AUC and sAUC. This performance substantiates the integration of VGG-19 features without fine-tuning, reinforcing the notion that these features are inherently adaptable for related visual tasks such as predicting eye fixation points.

Implications and Future Directions

The success of DeepGaze II highlights the advantages of utilizing pre-trained networks' feature hierarchies without the need for extensive retraining, proposing a compelling case paper for transfer learning efficacy in computer vision applications. For researchers and practitioners working on eye-tracking and gaze prediction, the results indicate a tendency toward reusing established networks optimized for object recognition to enhance performance in related tasks without substantial incremental training efforts.

Future developments could explore refining this methodology by experimenting with different network configurations or employing hybrid models that blend deep features with additional context-aware mechanisms. Additionally, expanding the applicability of such models to different gaze-related tasks could potentially exploit the versatile strengths observed in DeepGaze II’s architecture.

Ultimately, this work not only probes the boundaries of transfer learning in saliency prediction but also contributes to a deeper understanding of how pre-trained deep network features can be reappropriated across diverse tasks with limited label supervision.

PDF Markdown