RTGaze: Real-Time Gaze Tracking & Redirection
- RTGaze is a suite of methods for real-time gaze tracking, redirection, and estimation that combines neural rendering pipelines with lightweight segmentation.
- Its neural gaze redirection system achieves photorealistic synthesis at 16 FPS with an 800× speedup and employs geometric prior distillation for improved identity retention.
- Distinct frameworks in RTGaze deliver dynamic depth-level estimation and event-driven tracking with sub-millisecond latency and <0.5° mean gaze error for robust AR and VR use cases.
RTGaze refers to multiple families of methods and systems addressing the challenges of real-time gaze tracking, estimation, and redirection. These methods cover gaze-controllable neural rendering, dynamic target tracking in AR environments, and lightweight model-based eye tracking. The term encompasses both neural image synthesis pipelines and frameworks for high-speed, robust eye-gaze inference, as evidenced in three major strands of the literature: real-time neural gaze redirection from single images (Wang et al., 14 Nov 2025), dynamic gaze monitoring and depth-level estimation for transparent displays and AR (Seraj et al., 9 Jun 2024), and event-driven gaze tracking with lightweight segmentation (Feng et al., 2022).
1. Real-Time 3D-Aware Gaze Redirection from a Single Image
RTGaze (Wang et al., 14 Nov 2025) introduces a real-time, 3D-aware gaze redirection framework to synthesize photorealistic face images with controllable eye movements from only a single input image and a 2D gaze prompt . The system targets the long-standing challenge of combining 3D consistency, high-quality synthesis, and inference efficiency for practical AR/VR and telepresence.
The pipeline consists of the following core stages:
- Gaze-controllable representation learning: Dual encoders extract high-frequency features (a CNN capturing eyelash and iris texture) and low-frequency features (a DeepLabV3 backbone with ViT encoder for global facial structure). The gaze prompt is projected by an MLP and injected via cross-attention into the high-frequency branch, then fused into a gaze-controllable latent code.
- Triplane decoding and volumetric neural rendering: The latent code is decoded to a triplane feature tensor (as in EG3D, representing three orthogonal feature planes). A neural renderer using ray marching and volume rendering synthetizes the redirected image for any camera pose.
- Geometric prior distillation: Depth maps from a pretrained 3D-GAN portrait generator serve as a geometric teacher. The system uses L1 loss between teacher and student depth maps for shape regularization.
At inference, only a single feedforward pass (no GAN inversion) is required. The end-to-end network achieves 0.06s per image on an NVIDIA 3090 GPU, supporting real-time rates ($16$ FPS) with an speedup compared to previous 3D-aware neural radiance field (NeRF) approaches.
2. Dynamic Gaze Target Tracking and Depth-Level Estimation for Transparent Displays
A distinct RTGaze framework (Seraj et al., 9 Jun 2024) addresses gaze monitoring for AR/transparent display (TD) applications, such as automotive heads-up displays. It combines dynamic 2D widget (“gaze target”) tracking with categorical depth estimation of human gaze (on-plane, out-plane-near, out-plane-far).
Key architectural modules include:
- Dynamic Quadtree Target Tracking: The display region is managed by a Quadtree data structure, enabling widget insert/delete and sub-millisecond gaze queries. Each gaze point is mapped to its leaf node and overlapping/focused widget(s) are resolved by geometric containment and z-order priority.
- Multi-Stream Self-Attention Depth Inference: Parallel neural streams process gaze rotation vectors, eye positions, and the intersection/distance data. Intra-stream and inter-stream self-attention capture modality- and cross-modality dependencies, culminating in a lightweight softmax classifier for discrete depth assignment.
Latency is sub-millisecond on both CPUs and automotive-grade SoCs (e.g., TI TDA4 VM), with measured mean accuracy of and inference speed at $0.0007$s per sample (quantized). The framework supports highly dynamic, multi-widget AR interfaces and robustly segments gaze between the display and real-world background.
3. Event-Driven Gaze Tracking with Lightweight Segmentation
The RTGaze method (Feng et al., 2022) targets high-frequency, resource-constrained environments, such as near-eye AR/VR cameras operating at 30 Hz on mobile processors. The architecture is built around “Auto-ROI”:
- Software-emulated event camera: A binary event map is generated by thresholding normalized pixel intensity differences between consecutive frames, directly mimicking a hardware event camera’s output.
- ROI prediction network: A compact convolutional neural network (3 conv + 2 FC layers, 4.2K parameters) predicts the next region-of-interest from the event map, prior ROI, and edge maps.
- ROI-based segmentation: Only the predicted ROI (covering of pixels) is processed by a U-Net-style encoder-decoder with depthwise separable convolutions. When the eye is still, segmentation is extrapolated, reducing the compute load.
- Model-based gaze vector estimation: The segmented ROI is fit to a geometric eye model for 3D gaze estimation.
The complete pipeline achieves sub-0.5 mean gaze errors and Hz real-time operation with a compute speedup over state-of-the-art full-frame gaze networks.
4. Mathematical Formulations and Losses
Each RTGaze variant incorporates distinct mathematical machinery:
- Neural gaze redirection (Wang et al., 14 Nov 2025): The composite loss combines mask-guided 2D image reconstruction () over eye and face masks, a depth distillation loss () for matching teacher/student geometries, and a perceptual VGG-based loss (), weighted as
- Depth estimation (Seraj et al., 9 Jun 2024): The categorical cross-entropy loss supervises the multi-class depth classifier, with intra- and inter-stream self-attention:
- Event-driven tracking (Feng et al., 2022): Event maps are generated as
The segmentation loss is a composite of cross-entropy, edge, and shape consistency, and the ROI predictor is trained with mean-squared error.
5. Quantitative Evaluation and Benchmarking
Evaluation across all RTGaze systems uses standardized datasets, latency metrics, and accuracy scores.
Gaze Redirection (Wang et al., 14 Nov 2025):
| Metric | RTGaze | GazeNeRF | HeadNeRF |
|---|---|---|---|
| FID | 38.3 | 81.8 | 69.5 |
| PSNR (dB) | 19.0 | 15.45 | – |
| LPIPS | 0.262 | 0.291 | – |
| SSIM | 0.715 | 0.733 | – |
| Time (s) | 0.061 | 60.06 | – |
| Gaze Error (°) | 9.05 | 6.94 | – |
| ID Score | 60.7 | 45.2 | – |
RTGaze provides real-time performance with state-of-the-art FID and identity retention versus NeRF-based baselines, albeit with a slightly higher angular gaze error compared to GazeNeRF.
Dynamic Tracking (Seraj et al., 9 Jun 2024):
Achieves classification accuracy for depth, latency for 12 dynamic widgets, and maintains sub-millisecond end-to-end processing.
Event-Driven Tracking (Feng et al., 2022):
End-to-end gaze error at $32.5$ Hz ($40.2$ Hz for the small segmentation variant), with speedup over RITnet on the TEyeD dataset.
6. Ablation Studies, Limitations, and Prospects
All three research lines present systematic ablations:
- Gaze feature injection in neural gaze redirection is most effective in high-frequency branches; other injection points degrade FID or convergence (Wang et al., 14 Nov 2025). 3D geometric distillation substantially improves identity and visual quality.
- Self-attention modules in AR gaze depth estimation are critical: removing intra-stream or inter-stream attention leads to 20–35% drops in classification accuracy (Seraj et al., 9 Jun 2024).
- ROI and event map feedback in eye-tracking are essential for speed and mIoU; omitting these cues or using non-learning baselines increases error or compute (Feng et al., 2022).
Reported limitations include reliance on frontal priors, sensitivity to rapid head movements or widget overlaps, and imperfect robustness to occlusion and low-light scenarios. Prospective directions span multi-pose or self-supervised geometry learning, explicit eyeball modeling, temporal consistency for video, 3D widget layouts, and in-sensor integration for further latency and energy savings.
7. Contributions and Impact
RTGaze, in its multiple instantiations:
- Demonstrates efficient, real-time 3D-aware gaze synthesis for image redirection, with hybrid latent codes and geometric teacher distillation (Wang et al., 14 Nov 2025).
- Enables low-latency, high-accuracy attentional monitoring and interaction in AR environments with robust widget tracking and depth discrimination (Seraj et al., 9 Jun 2024).
- Establishes a scalable paradigm for event-driven eye tracking optimized for mobile and embedded hardware, supporting rapid updates and low-power consumption (Feng et al., 2022).
The cross-domain applicability underlines RTGaze as an evolving suite of gaze analysis and manipulation tools for AR/VR, telepresence, mobile interaction, and automotive displays.