Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition (1312.7570v1)

Published 29 Dec 2013 in cs.CV

Abstract: Systems based on bag-of-words models from image features collected at maxima of sparse interest point operators have been used successfully for both computer visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in `saccade and fixate' regimes, the methodology and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large scale dynamic computer vision annotated datasets like Hollywood-2 and UCF Sports with human eye movements collected under the ecological constraints of the visual action recognition task. To our knowledge these are the first large human eye tracking datasets to be collected and made publicly available for video, vision.imar.ro/eyetracking (497,107 frames, each viewed by 16 subjects), unique in terms of their (a) large scale and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as opposed to free-viewing. Second, we introduce novel sequential consistency and alignment measures, which underline the remarkable stability of patterns of visual search among subjects. Third, we leverage the significant amount of collected data in order to pursue studies and build automatic, end-to-end trainable computer vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-temporal interest point image sampling strategies and the human fixations, as well as their impact for visual recognition performance, but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging some of the advanced computer vision practice, can lead to state of the art results.

Citations (191)

View on Semantic Scholar

Summary

The paper introduces large-scale dynamic gaze datasets and novel consistency metrics to merge human attention cues with machine vision.
The methodology leverages end-to-end systems incorporating human saliency predictors to achieve state-of-the-art results on benchmarks like Hollywood-2 and UCF Sports.
Results challenge conventional machine descriptors, demonstrating that human-derived fixation data enhances visual recognition in dynamic contexts.

Overview of "Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition"

The paper "Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition" by Stefan Mathe and Cristian Sminchisescu explores the integration of human eye movement patterns within the frameworks of computer vision and action recognition systems. The paper provides significant methodological contributions to bridge the existing gap between human visual processing mechanisms and computer vision algorithms, specifically within the dynamic context of visual action recognition tasks.

Key Contributions

Dynamic Gaze Datasets: The authors introduce the first large-scale video datasets annotated with human eye movements under task-specific conditions. These datasets, derived from the Hollywood-2 and UCF Sports datasets, comprise 497,107 frames each viewed by 16 subjects, and are publicly available for research purposes. This offers a resource aligning human visual attention cues with established computer vision datasets, facilitating the exploration of biologically-inspired computational models.
Consistency and Alignment Measures: Novel metrics for evaluating the spatial and sequential consistency of human fixations across video frames and subjects are introduced. These measures confirm a significant degree of consistency in human gaze patterns, which challenge the prior notion of free-viewing biases and underline the less task-centric influence on dynamic fixations than previously assumed.
Learning Saliency Predictors: Utilizing a large corpus of human eye-tracking data, the paper pioneers a saliency prediction model integrated into computer vision tasks. This model reveals that aspects of visual recognition can be enhanced through human fixation-aligned features, leading to robust performance improvements.
End-to-End System Realization: The authors demonstrate that incorporating human attention-derived interest point operators into an end-to-end visual action recognition system can achieve state-of-the-art results on existing benchmarks. This validates the potential of a symbiotic integration of human vision characteristics and advanced computer vision methodologies.

Numerical Results and Bold Claims

The research has shown that when human-derived saliency maps are used to guide the interest points in visual recognition systems, there is a notable improvement in classification performance. This is quantitatively validated by achieving strong performance metrics on the Hollywood-2 and UCF Sports datasets.
The work challenges existing paradigms by using human vision insights to question the efficacy of commonly used machine vision descriptors like the Harris corner detector, prompting reconsideration of their alignment with human perception.

Implications and Future Directions

The practical implications of this research extend to improving the reliability and accuracy of computer vision systems, specifically within domains requiring nuanced understanding of human activities, such as surveillance, autonomous driving, and human-computer interaction applications. Theoretically, this paper reinforces the importance of biological vision insights in advancing computational models.

Speculating on future developments in AI, the integration of human-centric visual cues can transform how machines interpret visual data, potentially leading to systems capable of human-level understanding and interaction. Moreover, the exploration of how these findings could generalize across different visual tasks and datasets remains a fertile area for further research.

Conclusion

In "Actions in the Eye," Mathe and Sminchisescu deliver a compelling paper that not only bridges human visual processes with machine learning techniques but also sets a precedent for integrating human-derived insights into artificial systems. This paper provides valuable resources and a methodological framework for researchers pursuing advancements in action recognition and dynamic visual processing, propelling future innovations in the field of computer vision.

PDF Markdown