Detecting Attended Visual Targets in Video (2003.02501v2)

Published 5 Mar 2020 in cs.CV

Abstract: We address the problem of detecting attention targets in video. Our goal is to identify where each person in each frame of a video is looking, and correctly handle the case where the gaze target is out-of-frame. Our novel architecture models the dynamic interaction between the scene and head features and infers time-varying attention targets. We introduce a new annotated dataset, VideoAttentionTarget, containing complex and dynamic patterns of real-world gaze behavior. Our experiments show that our model can effectively infer dynamic attention in videos. In addition, we apply our predicted attention maps to two social gaze behavior recognition tasks, and show that the resulting classifiers significantly outperform existing methods. We achieve state-of-the-art performance on three datasets: GazeFollow (static images), VideoAttentionTarget (videos), and VideoCoAtt (videos), and obtain the first results for automatically classifying clinically-relevant gaze behavior without wearable cameras or eye trackers.

View on arXiv

Authors (4)

Eunji Chong (5 papers)
Yongxin Wang (21 papers)
Nataniel Ruiz (32 papers)
James M. Rehg (91 papers)

Citations (100)

View on Semantic Scholar

Summary

Detecting Attended Visual Targets in Video

The paper "Detecting Attended Visual Targets in Video" presents an innovative approach towards identifying where individuals are focusing their gaze in video sequences, including managing cases where the gaze is directed out of the frame. This research is pivotal as visual attention is integral to understanding human social behavior, visual navigation, and interaction with three-dimensional environments.

Key Contributions and Methodology

The authors introduce a novel spatiotemporal deep learning architecture designed to model the dynamic interaction between scene content and head orientation features. This model infers time-varying attention targets by leveraging an attention mechanism within the network, which uses head features to regulate the processing of scene features. To enable robust prediction, the architecture incorporates a ConvLSTM network to capture temporal patterns, facilitating the continuity of gaze predictions across frames.

A significant contribution of the research is the introduction of the VideoAttentionTarget dataset, annotated to capture complex and dynamic gaze behavior in real-world environments. The dataset contains over 1,300 video sequences with detailed frame-by-frame annotations of gaze targets, offering a rich resource for understanding gaze dynamics.

Experimental Results and Evaluation

The paper presents substantial experimental evidence demonstrating the effectiveness of their approach. On the GazeFollow dataset, focused on static images, the spatial component of their model surpassed previous methods significantly, achieving an AUC of 0.921, which closely approaches human-level performance (AUC of 0.924). For the full spatiotemporal model tested on the novel VideoAttentionTarget dataset, the architecture outperformed existing techniques, achieving high metric scores across AUC, average distance, and out-of-frame detection tasks.

Notably, their model's predicted attention maps were applied to two different social gaze behavior recognition challenges, achieving state-of-the-art performance. In a paper involving toddlers, the methodology effectively detected clinically relevant gaze shifts, a crucial task in assessing autism-related behaviors.

Implications and Future Directions

The implications of this work are far-reaching. The ability to infer gaze targets without obtrusive wearable devices opens possibilities for large-scale studies in naturalistic settings, which are crucial in domains like social interaction analysis and navigation systems. Moreover, the advancement in gaze behavior recognition paves the way for automated systems in clinical settings, particularly in developmental disorders such as autism.

Looking forward, future research could explore the integration of this gaze detection framework with more comprehensive social interaction models, potentially incorporating more sophisticated contextual and environmental factors. Additionally, the model's adaptation to varying contexts and populations could enhance its generalizability and efficacy across broader applications, such as human-robot interaction and attentive user interfaces in consumer electronics.

This paper lays foundational work for dynamic gaze prediction in video sequences and establishes a pathway for future research to expand upon these methodologies, potentially fostering innovations in AI-driven social behavior analysis and interaction modeling.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos