L2CS-Net: Fine-Grained Gaze Estimation in Unconstrained Environments (2203.03339v1)

Published 7 Mar 2022 in cs.CV, cs.LG, and cs.RO

Abstract: Human gaze is a crucial cue used in various applications such as human-robot interaction and virtual reality. Recently, convolution neural network (CNN) approaches have made notable progress in predicting gaze direction. However, estimating gaze in-the-wild is still a challenging problem due to the uniqueness of eye appearance, lightning conditions, and the diversity of head pose and gaze directions. In this paper, we propose a robust CNN-based model for predicting gaze in unconstrained settings. We propose to regress each gaze angle separately to improve the per-angel prediction accuracy, which will enhance the overall gaze performance. In addition, we use two identical losses, one for each angle, to improve network learning and increase its generalization. We evaluate our model with two popular datasets collected with unconstrained settings. Our proposed model achieves state-of-the-art accuracy of 3.92{\deg} and 10.41{\deg} on MPIIGaze and Gaze360 datasets, respectively. We make our code open source at https://github.com/Ahmednull/L2CS-Net.

Authors (4)

Ahmed A. Abdelrahman (4 papers)
Thorsten Hempel (8 papers)
Aly Khalifa (4 papers)
Ayoub Al-Hamadi (8 papers)

Citations (65)

View on Semantic Scholar

Summary

The paper introduces a dual-loss CNN that separately regresses yaw and pitch to enhance gaze estimation accuracy in real-world settings.
It achieves mean angular errors of 3.92° on MPIIGaze and 10.41° on Gaze360, setting a new benchmark in unconstrained gaze estimation.
The model’s robust design has practical implications for augmented reality and human-robot interaction by improving real-time gaze tracking.

L2CS-Net: Fine-Grained Gaze Estimation in Unconstrained Environments

The paper "L2CS-Net: Fine-Grained Gaze Estimation in Unconstrained Environments" presents a notable contribution to the domain of gaze estimation using convolutional neural networks (CNNs). The authors introduce a novel model, L2CS-Net, which targets the challenge of accurately estimating human gaze in environments that are not controlled, addressing variables such as eye appearance, lighting conditions, and head pose diversity.

Methodological Approach

The authors propose a CNN-based model that distinguishes itself by regressing each gaze angle separately—yaw and pitch—using distinct fully-connected layers. This methodological choice aims to enhance the precision of per-angle predictions, which contributes to improved overall gaze estimation. Furthermore, the model adopts a dual-loss strategy, employing identical losses for each angle to bolster network learning and generalization capabilities.

A distinguishing feature of the L2CS-Net is the combined use of cross-entropy loss and mean-squared error (MSE) for each gaze angle. This dual-loss approach integrates classification and regression tasks, utilizing a softmax layer to predict binned gaze classification and refining predictions through MSE. This strategy facilitates a more flexible tuning of the network, accommodating the non-linear nature of gaze direction determination.

Empirical Evaluation

The evaluation of L2CS-Net was conducted on two prominent datasets known for their unconstrained conditions: MPIIGaze and Gaze360. The model achieved state-of-the-art results with mean angular errors of 3.92° on MPIIGaze and 10.41° on Gaze360, surpassing previous methods in both settings. These results demonstrate the model's robustness and accuracy, particularly notable given the complexities introduced by naturalistic settings.

Implications and Future Directions

The implications of this research extend to various applications, such as augmented reality and human-robot interaction, where accurate gaze estimation can enhance user experience and system efficiency. The focus on capturing gaze in "in-the-wild" environments aligns with the increasing demand for adaptable and generalizable AI systems.

Theoretically, this work contributes to the ongoing discourse on optimizing CNN architectures for specific tasks by separating output prediction tasks and employing sophisticated loss functions. It opens avenues for future research in integrating multi-loss strategies within broader AI models and refining these approaches for enhanced robustness and adaptability.

In conclusion, the L2CS-Net represents a significant advancement in gaze estimation technologies, integrating innovative methodologies to address the prevalent challenges of unconstrained environmental conditions. This work not only sets a new benchmark in gaze estimation performance but also inspires further exploration into multi-faceted loss functions and tailored CNN architectures.

PDF Markdown

Related Papers

GitHub

GitHub - Ahmednull/L2CS-Net: The official PyTorch implementation of L2CS-Net for gaze estimation and tracking (338 stars)