RTGaze: Real-Time Gaze Tracking & Redirection

Updated 21 November 2025

RTGaze is a suite of methods for real-time gaze tracking, redirection, and estimation that combines neural rendering pipelines with lightweight segmentation.
Its neural gaze redirection system achieves photorealistic synthesis at 16 FPS with an 800× speedup and employs geometric prior distillation for improved identity retention.
Distinct frameworks in RTGaze deliver dynamic depth-level estimation and event-driven tracking with sub-millisecond latency and <0.5° mean gaze error for robust AR and VR use cases.

RTGaze refers to multiple families of methods and systems addressing the challenges of real-time gaze tracking, estimation, and redirection. These methods cover gaze-controllable neural rendering, dynamic target tracking in AR environments, and lightweight model-based eye tracking. The term encompasses both neural image synthesis pipelines and frameworks for high-speed, robust eye-gaze inference, as evidenced in three major strands of the literature: real-time neural gaze redirection from single images (Wang et al., 14 Nov 2025), dynamic gaze monitoring and depth-level estimation for transparent displays and AR (Seraj et al., 2024), and event-driven gaze tracking with lightweight segmentation (Feng et al., 2022).

1. Real-Time 3D-Aware Gaze Redirection from a Single Image

RTGaze (Wang et al., 14 Nov 2025) introduces a real-time, 3D-aware gaze redirection framework to synthesize photorealistic face images with controllable eye movements from only a single input image and a 2D gaze prompt $(g = (\text{pitch},\ \text{yaw}))$ . The system targets the long-standing challenge of combining 3D consistency, high-quality synthesis, and inference efficiency for practical AR/VR and telepresence.

The pipeline consists of the following core stages:

Gaze-controllable representation learning: Dual encoders extract high-frequency features (a CNN capturing eyelash and iris texture) and low-frequency features (a DeepLabV3 backbone with ViT encoder for global facial structure). The gaze prompt is projected by an MLP and injected via cross-attention into the high-frequency branch, then fused into a gaze-controllable latent code.
Triplane decoding and volumetric neural rendering: The latent code is decoded to a triplane feature tensor (as in EG3D, representing three orthogonal feature planes). A neural renderer using ray marching and volume rendering synthetizes the redirected image for any camera pose.
Geometric prior distillation: Depth maps from a pretrained 3D-GAN portrait generator serve as a geometric teacher. The system uses L1 loss between teacher and student depth maps for shape regularization.

At inference, only a single feedforward pass (no GAN inversion) is required. The end-to-end network achieves $\sim$ 0.06s per $512{\times}512$ image on an NVIDIA 3090 GPU, supporting real-time rates ($16$ FPS) with an $800{\times}$ speedup compared to previous 3D-aware neural radiance field (NeRF) approaches.

2. Dynamic Gaze Target Tracking and Depth-Level Estimation for Transparent Displays

A distinct RTGaze framework (Seraj et al., 2024) addresses gaze monitoring for AR/transparent display (TD) applications, such as automotive heads-up displays. It combines dynamic 2D widget (“gaze target”) tracking with categorical depth estimation of human gaze (on-plane, out-plane-near, out-plane-far).

Key architectural modules include:

Dynamic Quadtree Target Tracking: The display region is managed by a Quadtree data structure, enabling $O(\log n)$ widget insert/delete and sub-millisecond gaze queries. Each gaze point is mapped to its leaf node and overlapping/focused widget(s) are resolved by geometric containment and z-order priority.
Multi-Stream Self-Attention Depth Inference: Parallel neural streams process gaze rotation vectors, eye positions, and the intersection/distance data. Intra-stream and inter-stream self-attention capture modality- and cross-modality dependencies, culminating in a lightweight softmax classifier for discrete depth assignment.

Latency is sub-millisecond on both CPUs and automotive-grade SoCs (e.g., TI TDA4 VM), with measured mean accuracy of $97.1\%\pm1.2\%$ and inference speed at $0.0007$s per sample (quantized). The framework supports highly dynamic, multi-widget AR interfaces and robustly segments gaze between the display and real-world background.

3. Event-Driven Gaze Tracking with Lightweight Segmentation

The RTGaze method (Feng et al., 2022) targets high-frequency, resource-constrained environments, such as near-eye AR/VR cameras operating at $>$ 30 Hz on mobile processors. The architecture is built around “Auto-ROI”:

Software-emulated event camera: A binary event map is generated by thresholding normalized pixel intensity differences between consecutive frames, directly mimicking a hardware event camera’s output.
ROI prediction network: A compact convolutional neural network (3 conv + 2 FC layers, $\approx$ 4.2K parameters) predicts the next region-of-interest from the event map, prior ROI, and edge maps.
ROI-based segmentation: Only the predicted ROI (covering $18{-}32\%$ of pixels) is processed by a U-Net-style encoder-decoder with depthwise separable convolutions. When the eye is still, segmentation is extrapolated, reducing the compute load.
Model-based gaze vector estimation: The segmented ROI is fit to a geometric eye model for 3D gaze estimation.

The complete pipeline achieves sub-0.5 $^\circ$ mean gaze errors and $>30$ Hz real-time operation with a $5.5{\times}$ compute speedup over state-of-the-art full-frame gaze networks.

4. Mathematical Formulations and Losses

Each RTGaze variant incorporates distinct mathematical machinery:

Neural gaze redirection (Wang et al., 14 Nov 2025): The composite loss combines mask-guided 2D image reconstruction ( $\mathcal{L}_R$ ) over eye and face masks, a depth distillation loss ( $\mathcal{L}_D$ ) for matching teacher/student geometries, and a perceptual VGG-based loss ( $\mathcal{L}_P$ ), weighted as

$\mathcal{L}_\text{total} = \alpha\, \mathcal{L}_R + \beta\, \mathcal{L}_D + \gamma\, \mathcal{L}_P.$

Depth estimation (Seraj et al., 2024): The categorical cross-entropy loss supervises the multi-class depth classifier, with intra- and inter-stream self-attention:

$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^C y_{i,c} \log \hat{p}_{i,c}.$

Event-driven tracking (Feng et al., 2022): Event maps are generated as

$E_{t+1}(x,y) = \Phi\left(\frac{|F_t(x,y) - F_{t+1}(x,y)|}{F_t(x,y)}, \sigma\right).$

The segmentation loss is a composite of cross-entropy, edge, and shape consistency, and the ROI predictor is trained with mean-squared error.

5. Quantitative Evaluation and Benchmarking

Evaluation across all RTGaze systems uses standardized datasets, latency metrics, and accuracy scores.

Gaze Redirection (Wang et al., 14 Nov 2025):

Metric	RTGaze	GazeNeRF	HeadNeRF
FID	38.3	81.8	69.5
PSNR (dB)	19.0	15.45	–
LPIPS	0.262	0.291	–
SSIM	0.715	0.733	–
Time (s)	0.061	60.06	–
Gaze Error (°)	9.05	6.94	–
ID Score	60.7	45.2	–

RTGaze provides real-time performance with state-of-the-art FID and identity retention versus NeRF-based baselines, albeit with a slightly higher angular gaze error compared to GazeNeRF.

Dynamic Tracking (Seraj et al., 2024):

Achieves $97.1\%$ classification accuracy for depth, $0.31\,\text{ms}$ latency for 12 dynamic widgets, and maintains sub-millisecond end-to-end processing.

Event-Driven Tracking (Feng et al., 2022):

End-to-end gaze error $<0.5^\circ$ at $32.5$ Hz ($40.2$ Hz for the small segmentation variant), with $5.7\times$ speedup over RITnet on the TEyeD dataset.

6. Ablation Studies, Limitations, and Prospects

All three research lines present systematic ablations:

Gaze feature injection in neural gaze redirection is most effective in high-frequency branches; other injection points degrade FID or convergence (Wang et al., 14 Nov 2025). 3D geometric distillation substantially improves identity and visual quality.
Self-attention modules in AR gaze depth estimation are critical: removing intra-stream or inter-stream attention leads to $\sim$ 20–35% drops in classification accuracy (Seraj et al., 2024).
ROI and event map feedback in eye-tracking are essential for speed and mIoU; omitting these cues or using non-learning baselines increases error or compute (Feng et al., 2022).

Reported limitations include reliance on frontal priors, sensitivity to rapid head movements or widget overlaps, and imperfect robustness to occlusion and low-light scenarios. Prospective directions span multi-pose or self-supervised geometry learning, explicit eyeball modeling, temporal consistency for video, 3D widget layouts, and in-sensor integration for further latency and energy savings.

7. Contributions and Impact

RTGaze, in its multiple instantiations:

Demonstrates efficient, real-time 3D-aware gaze synthesis for image redirection, with hybrid latent codes and geometric teacher distillation (Wang et al., 14 Nov 2025).
Enables low-latency, high-accuracy attentional monitoring and interaction in AR environments with robust widget tracking and depth discrimination (Seraj et al., 2024).
Establishes a scalable paradigm for event-driven eye tracking optimized for mobile and embedded hardware, supporting rapid updates and low-power consumption (Feng et al., 2022).

The cross-domain applicability underlines RTGaze as an evolving suite of gaze analysis and manipulation tools for AR/VR, telepresence, mobile interaction, and automotive displays.