Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Few-Shot Adaptive Gaze Estimation (1905.01941v2)

Published 6 May 2019 in cs.CV

Abstract: Inter-personal anatomical differences limit the accuracy of person-independent gaze estimation networks. Yet there is a need to lower gaze errors further to enable applications requiring higher quality. Further gains can be achieved by personalizing gaze networks, ideally with few calibration samples. However, over-parameterized neural networks are not amenable to learning from few examples as they can quickly over-fit. We embrace these challenges and propose a novel framework for Few-shot Adaptive GaZE Estimation (FAZE) for learning person-specific gaze networks with very few (less than or equal to 9) calibration samples. FAZE learns a rotation-aware latent representation of gaze via a disentangling encoder-decoder architecture along with a highly adaptable gaze estimator trained using meta-learning. It is capable of adapting to any new person to yield significant performance gains with as few as 3 samples, yielding state-of-the-art performance of 3.18 degrees on GazeCapture, a 19% improvement over prior art. We open-source our code at https://github.com/NVlabs/few_shot_gaze

Citations (183)

Summary

  • The paper introduces Faze, a framework that adapts gaze estimation to individual users using as few as three calibration samples.
  • It combines a Disentangling Transforming Encoder-Decoder with a meta-learning based Adaptable Gaze Estimation Network to overcome inter-personal differences.
  • Evaluated on GazeCapture and MPIIGaze, Faze achieves state-of-the-art angular errors of 3.18° and 3.14° with a 19% improvement over previous methods.

Essay on "Few-Shot Adaptive Gaze Estimation"

The paper "Few-Shot Adaptive Gaze Estimation" introduces a novel approach to personalized gaze estimation utilizing few-shot learning techniques. The motivation behind this work stems from the inherently inter-personal anatomical differences which pose challenges to achieving high accuracy in gaze estimation from person-independent networks. The proposed solution, referred to as Faze, leverages a combination of disentangled latent representations, rotation-aware embedding, and meta-learning, ultimately aiming to achieve significant performance gains with as few as three calibration samples per individual.

The Faze framework comprises three main components:

  1. Disentangling Transforming Encoder-Decoder (DT-ED): This component learns a rotation-aware latent representation of gaze by disentangling it from other factors such as appearance and head pose. The encoder-decoder structure enforces equivariance to gaze and head rotations, which aids in forming a robust gaze representation despite variations across individuals.
  2. Adaptable Gaze Estimation Network (AdaGEN): Utilizing the meta-learning framework, specifically MAML (Model-Agnostic Meta-Learning), this component is geared towards quickly adapting to new subjects with minimal calibration data. The critical insight is that person-specific factors have slight variations across a population, making meta-learning highly suitable for this task. The network is trained to learn a set of initial weights that can be efficiently fine-tuned with limited subject-specific data to yield personalized models.
  3. Fine-tuning and Performance Evaluation: The adaptation phase involves fine-tuning AdaGEN with the calibration samples to produce a person-specific gaze estimation model. The paper evaluates the framework's performance across two benchmark datasets: GazeCapture and MPIIGaze, exhibiting substantial improvements over prior methods. Notably, Faze achieves state-of-the-art errors of 3.183.18^\circ and 3.143.14^\circ on the GazeCapture and MPIIGaze datasets, respectively, using only nine calibration samples.

The paper reports highlighted results, showcasing a 19% improvement in gaze estimation error on the GazeCapture dataset compared to the previous best-performing methods. This is accomplished by the introduced embedding consistency loss and rotation-equivariant design, which together ensure more consistent and informative latent representations across individual differences. Additionally, meta-learning enables efficient utilization of minimal subject-specific data, crucial for practical applications needing high accuracy.

This research has direct implications for various applications, including human-computer interaction, VR/AR devices, automotive UX, and more, where precise gaze tracking is critical. It paves the way for using deep learning-based techniques in gaze estimation with compact, adaptable models that can be personalized on consumer devices with limited resources.

Future directions may explore further reducing the dependence on calibration samples by enhancing unsupervised learning components or integrating domain adaptation strategies. Additionally, exploring alternate model architectures or meta-learning techniques could potentially enhance adaptability and generalization further. The open-sourcing of their code stands testament to the practical applicability of their framework, offering a real-time demonstration and providing opportunities for further advancements and explorations in adaptive gaze estimation systems.