CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research

Published 14 May 2015 in cs.CV | (1505.03581v1)

Abstract: Saliency modeling has been an active research area in computer vision for about two decades. Existing state of the art models perform very well in predicting where people look in natural scenes. There is, however, the risk that these models may have been overfitting themselves to available small scale biased datasets, thus trapping the progress in a local minimum. To gain a deeper insight regarding current issues in saliency modeling and to better gauge progress, we recorded eye movements of 120 observers while they freely viewed a large number of naturalistic and artificial images. Our stimuli includes 4000 images; 200 from each of 20 categories covering different types of scenes such as Cartoons, Art, Objects, Low resolution images, Indoor, Outdoor, Jumbled, Random, and Line drawings. We analyze some basic properties of this dataset and compare some successful models. We believe that our dataset opens new challenges for the next generation of saliency models and helps conduct behavioral studies on bottom-up visual attention.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (273)

View on Semantic Scholar

Summary

The paper presents a comprehensive fixation dataset with 4,000 images across 20 categories collected from 120 human observers.
It employs rigorous eye-tracking methods using 1920x1080 images to capture detailed human gaze behavior in diverse settings.
The evaluation shows that current saliency models underperform relative to human benchmarks, underscoring the need for more advanced, context-aware modeling.

An Overview of the CAT2000 Dataset for Saliency Research

The paper "CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research" introduces a comprehensive dataset of fixation data collected from human observers viewing a wide variety of image categories. This dataset is particularly significant in the context of saliency modeling, where accurately predicting human gaze behavior is crucial for understanding visual attention mechanisms and improving computer vision systems.

Motivation and Dataset Composition

Saliency models aim to predict which parts of a visual scene are most likely to attract human attention. Despite advancements in the field, existing models are often tested on small, potentially biased datasets, which may limit their generalizability and practical applicability. This paper addresses the need for a more robust and diverse dataset by introducing CAT2000, which encompasses 4,000 images, categorized into 20 distinct classes, including but not limited to Action, Indoor, Outdoor natural, and Social scenes.

Each image in the dataset is accompanied by human fixation data collected from 120 observers, providing rich insights into both bottom-up and top-down attentional processes in various visual contexts. The resolution of the images is 1920x1080 pixels, and they represent a wide range of scenarios from abstract patterns to complex social interactions.

Methodology and Observer Details

The dataset's robustness is underscored by its methodological rigor. Observers, who were undergraduate students at the University of Southern California, participated in carefully designed eye-tracking experiments. Participants viewed each image for a duration of five seconds, with their eye movements tracked by a high-resolution Eyelink eye-tracking device. The procedure involved multiple sessions to ensure comprehensive coverage and included calibration measures to enhance data accuracy and reliability.

Key Findings and Model Evaluation

An essential component of the study involves analyzing the dataset's properties and evaluating state-of-the-art saliency models against it. Findings suggest varying levels of center-bias across different categories, with categories such as Indoor and Social showing less central fixation tendencies compared to Sketch and Art. This underscores the importance of addressing dataset biases when modeling human perception.

Performance analysis of popular models like ITTI, HouCVPR, GBVS, and AWS revealed a disparity in their predictive accuracy, with all models performing below the inter-observer benchmark. Interestingly, models excelled in categories like Sketch but struggled with complex scenes like Social and Satellite images. Such results imply that while current models can approximate simple gaze patterns, they still fall short of capturing the nuances of human attention in dynamic and semantically rich environments.

Implications and Future Directions

The CAT2000 dataset represents a significant resource for saliency research. Its large scale and diversity allow for benchmarking advances in saliency modeling while preserving the ecological validity of stimuli. Moreover, the dataset offers opportunities to study attentional behavior in naturalistic settings, facilitating investigations into how semantic and contextual factors modulate gaze behavior.

Looking ahead, this dataset could inspire the development of saliency models that integrate deeper contextual understanding and semantic interpretation, aligning closer to human visual processing systems. Additionally, the dataset is made publicly available for the community, enhancing collaborative research efforts in the domain.

Conclusion

In summary, the introduction of the CAT2000 dataset marks a step forward in saliency research, providing a valuable tool for the development and evaluation of visual attention models. By addressing the limitations of prior datasets and offering comprehensive fixation data, it paves the way for more accurate and generalizable models, thereby advancing our understanding of visual attention mechanisms and their application in computer vision.

Markdown Report Issue