- The paper presents a streamlined gaze estimation model that reduces learning parameters to about 5% compared to traditional multi-branch approaches.
- It leverages a frozen DINOv2 transformer with a lightweight decoder and head positional prompts to efficiently capture and process scene features.
- Experiments on multiple benchmarks demonstrate state-of-the-art performance with rapid training times and robust cross-dataset generalization.
Gaze-LLE: A Novel Approach to Gaze Target Estimation with Large-Scale Learned Encoders
The paper "Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders" presents a streamlined approach to gaze target estimation that offers fundamental advancements over traditional multi-branch architectures. This work leverages the DINOv2 transformer as a foundational model to simplify the architecture, demonstrating the potency of using large-scale unsupervised learned feature representations for dense prediction tasks beyond their common applications.
Overview of Gaze Target Estimation
Gaze target estimation is a task that involves predicting where in a scene a person is looking. It is an integral component for understanding human behavior, with specific relevance in social interactions and human-computer interactions. Historically, this has been approached using elaborate multi-branch methods that synthesize outputs from various encoders—each focusing on different cues such as depth, pose, and specific head crops. These methods, while successful, have been inherently complex and challenging to train, requiring substantial computational resources and time.
The Gaze-LLE Model
The authors propose Gaze-LLE, a novel model that revolutionizes gaze target estimation by employing a single backbone with state-of-the-art performance. Gaze-LLE utilizes a frozen image encoder from the DINOv2 transformer to extract features from the scene, and it replaces elaborate head-specific branches with a lightweight decoder that utilizes head positional prompts. This innovative architecture capitalizes on the robustness of DINOv2's learned feature representations while vastly reducing the number of learning parameters to about 5% of traditional methods. The encoder captures extensive scene information at once, eliminating the necessity for multi-modal fusion and yet achieving competitive performance.
Methodology and Results
Key to Gaze-LLE's architecture is a positional prompt that injects person-specific head features into the scene representation extracted by DINOv2. The encoded information is processed using a compact transformer module, enhancing feature representation with global context before decoding it into a gaze heatmap. This design not only simplifies the training pipeline but significantly accelerates it, with state-of-the-art performance reached in less than 1.5 GPU hours.
The authors conducted extensive experiments across well-established gaze benchmarks such as GazeFollow, VideoAttentionTarget, ChildPlay, and GOO-Real. The results convincingly show that Gaze-LLE achieves state-of-the-art performance regarding heatmap AUC and L2 metrics for gaze prediction, proving its efficacy and generalizability across domains. It is notable that the Gaze-LLE model maintains robust cross-dataset performance without additional finetuning, suggesting its potential versatility in real-world applications.
Implications and Future Directions
The practical implications of Gaze-LLE are convincing—it greatly reduces computational demands and simplifies modeling for gaze estimation tasks. Theoretically, this work indicates that pre-trained foundational architectures can be effectively adapted for specialized tasks, which may encourage researchers to explore similar applications across other prediction tasks in computer vision.
Looking forward, Gaze-LLE's framework establishes a baseline for using frozen foundational encoders in perception tasks. As better and faster foundational models become available, Gaze-LLE provides a tractable strategy for integrating them into gaze estimation, potentially enhancing both the performance and the efficiency of systems that rely on understanding human visual behavior. The code availability further supports the notion that more efficient and accurate models can be developed from this foundation, opening avenues for exploration into larger and more complex scenarios, including multi-person and social gaze estimation contexts. This work thus paves the way for a shift from highly specialized architectures to general-purpose feature extractors in the gaze estimation domain.