Believe It or Not, We Know What You Are Looking at! (1907.02364v1)

Published 4 Jul 2019 in cs.CV

Abstract: By borrowing the wisdom of human in gaze following, we propose a two-stage solution for gaze point prediction of the target persons in a scene. Specifically, in the first stage, both head image and its position are fed into a gaze direction pathway to predict the gaze direction, and then multi-scale gaze direction fields are generated to characterize the distribution of gaze points without considering the scene contents. In the second stage, the multi-scale gaze direction fields are concatenated with the image contents and fed into a heatmap pathway for heatmap regression. There are two merits for our two-stage solution based gaze following: i) our solution mimics the behavior of human in gaze following, therefore it is more psychological plausible; ii) besides using heatmap to supervise the output of our network, we can also leverage gaze direction to facilitate the training of gaze direction pathway, therefore our network can be more robustly trained. Considering that existing gaze following dataset is annotated by the third-view persons, we build a video gaze following dataset, where the ground truth is annotated by the observers in the videos. Therefore it is more reliable. The evaluation with such a dataset reflects the capacity of different methods in real scenarios better. Extensive experiments on both datasets show that our method significantly outperforms existing methods, which validates the effectiveness of our solution for gaze following. Our dataset and codes are released in https://github.com/svip-lab/GazeFollowing.

Citations (67)

View on Semantic Scholar

Summary

The paper presents a novel two-stage framework that combines multi-scale gaze direction fields with image content to predict gaze points accurately.
The approach reduces average angular error from over 22° to 17.6° on benchmarks like GazeFollow and DL Gaze.
It offers valuable resources and insights for real-world applications, including augmented reality and consumer behavior analysis.

An Expert Review of "Believe It or Not, We Know What You Are Looking at!"

The research paper "Believe It or Not, We Know What You Are Looking at!" presents a novel approach to the task of gaze following—an integral component for understanding human interaction with objects or other individuals within a scene. The authors introduce a two-stage deep learning framework that effectively predicts gaze points by mimicking human gaze-following behavior. This work is underpinned by two primary contributions: the proposal of a psychologically plausible gaze prediction system and the development of a comprehensive gaze-following dataset.

Technical Contributions

The proposed methodology consists of two distinct stages. Initially, the gaze direction is predicted using a gaze direction pathway that processes both the head image and its positional data within a scene. The innovative aspect of this stage is the generation of multi-scale gaze direction fields, which represent the distribution of potential gaze points without incorporating scene content.

The second stage involves a heatmap pathway where the generated multi-scale gaze direction fields are merged with original image content to predict the probability heatmap of the gaze point—a technique inspired by how humans infer gaze based on both head orientation and surrounding context. This architectural design facilitates more robust training by leveraging gaze direction data during the learning process.

The authors posit that their two-stage approach outperforms existing methodologies, and they provide empirical evidence through extensive experimentation. Notably, their method yields superior performance metrics across several dimensions when tested on the GazeFollow and the newly introduced Daily Life Gaze dataset (DL Gaze).

Numerical Results and Dataset

The paper reports significant improvements over baseline gaze-following models. For instance, using their multi-scale implementation, they achieved a reduction in average angular error to 17.6 degrees from prior results exceeding 22 degrees, highlighting the model's precision. The authors release both their dataset and codes, contributing a valuable resource for future research enhancement.

DL Gaze, the dataset introduced, enhances evaluation realism as it comprises real-world scenarios annotated by the observed individuals themselves, rather than third-party annotation, offering reliability and reflective testing conditions for gaze prediction tasks.

Practical and Theoretical Implications

Practically, the incorporation of multi-scale gaze direction enhances the practical application, suggesting deployments in augmented reality or consumer behavior analysis, where correct interpretation of gaze can yield valuable insights into user intent and interaction preferences.

Theoretically, the integration of human-like cognitive strategies into the architecture of machine perception systems reinforces the psychological plausibility of AI models. This aligns well with existing theories on human perception and cognitive processing, suggesting an avenue for further interdisciplinary exploration in AI and cognitive science domains.

Future Directions

Future work could focus on refining the gaze prediction accuracy by incorporating additional contextual data, such as scene depth or temporal movement cues, potentially garnered from 3D imaging techniques or temporal video analysis. Interdisciplinary approaches involving cognitive psychology could also yield useful insights into refining AI models to better emulate human gaze-following behavior.

Moreover, the exploration of attention mechanisms within neural networks to model human gaze behavior more intricately remains a promising area of research. Further validation using larger and more diverse datasets would continue to enhance the robust applicability of the proposed method beyond controlled lab environments to complex real-world scenarios.

In summary, this paper advances the understanding of gaze following through a methodologically sound approach that not only sets a new benchmark in terms of accuracy but also introduces resources and ideas conducive to fostering further innovation in AI-driven perception systems.

PDF Markdown

Related Papers

GitHub

GitHub - svip-lab/GazeFollowing: Code for ACCV2018 paper 'Believe It or Not, We Know What You Are Looking at!' (102 stars)