ST-Gaze: Spatio-Temporal Gaze Estimation

Updated 2 July 2026

ST-Gaze is a deep neural architecture that integrates spatial attention and temporal recurrence to achieve state-of-the-art video-based gaze estimation.
It employs dedicated CNN backbones, transformer-based self-attention, and GRU recurrence to extract and fuse intra-frame and inter-frame features.
Person-specific adaptation via SCPT or Gaussian processes enables rapid calibration, significantly improving accuracy on benchmark datasets.

ST-Gaze is a class of deep neural architectures designed for video-based gaze estimation that explicitly captures both intra-frame spatial relationships and inter-frame temporal dependencies. This family of models combines spatial attention mechanisms, temporal sequence modeling, and, in advanced variants, person-specific adaptation using Gaussian processes. By integrating these components, ST-Gaze achieves state-of-the-art accuracy on standard benchmarks and can further adapt to individual users with minimal calibration data, establishing itself as a robust approach for gaze estimation in naturalistic video settings (Jindal et al., 2024, Personnic et al., 19 Dec 2025).

1. Core Architectural Components

The ST-Gaze framework is fundamentally modular and factorizes the gaze estimation task into distinct processing stages, each targeting a specific aspect of the input signal and learning problem (Jindal et al., 2024, Personnic et al., 19 Dec 2025):

Input Preparation: For each frame, typically three cropped patches are extracted: left eye, right eye (horizontally flipped), and face (e.g., $3\times128\times128$ RGB crops).
Backbone Feature Extraction: Each patch is passed through a dedicated CNN backbone—EfficientNet-B3 (for eye/face) (Personnic et al., 19 Dec 2025) or a shared ResNet-18 with pretraining (for full-face crops) (Jindal et al., 2024). Resulting features are concatenated channel-wise (e.g., $X_t\in\R^{160\times8\times8}$ ).
Channel Attention (ECA): Efficient Channel Attention weights features across channels using global pooling and a lightweight 1D convolution, boosting feature selectivity with negligible overhead (Personnic et al., 19 Dec 2025).
Spatial Self-Attention: The channel-attended features are reshaped into a spatial sequence and enriched via transformer encoder layers, enabling long-range modeling across distant spatial locations within a frame (Personnic et al., 19 Dec 2025).
Spatio-Temporal Recurrence: A two-layer GRU is used to propagate spatial context both within a frame (64-step “spatial scan”) and temporally, by forwarding the last spatial hidden state as initialization for the next time step. This design ensures that the model maintains persistent memory of global spatial context across the video (Personnic et al., 19 Dec 2025).
Gaze Regression: The output sequence is pooled and passed to a regression head (typically an MLP with Tanh activation), which produces the 2D gaze vector (pitch, yaw). The vector is constrained to represent a direction by scaling to $[-\pi/2, \pi/2]^2$ (Personnic et al., 19 Dec 2025).

The variant described in (Jindal et al., 2024) uses a Hybrid Spatial Attention Module that directly processes pairs of adjacent frame feature maps, computes cross-attention with positional encodings, and outputs a motion-aware representation that captures short-term dynamics.

2. Spatio-Temporal Fusion and Feature Design

ST-Gaze distinguishes itself by its hierarchical approach to fusing spatial and temporal information (Personnic et al., 19 Dec 2025, Jindal et al., 2024):

Channel-then-Spatial Attention: Attention is first applied to the channel dimension (for feature selection), followed by spatial attention via transformers (for modeling intra-frame dependencies).
Intra-Frame Recurrence: Patches from each frame are processed in sequence by a recurrent network, enforcing an order and capturing dependencies that would be lost with premature spatial pooling.
Inter-Frame Recurrence: By carrying forward the GRU’s hidden state from one frame to the next, temporal continuity and motion cues are preserved.
Hybrid Spatial Attention (Alternative): The Hybrid-SAM variant in (Jindal et al., 2024) operates on the difference between consecutive feature maps, explicitly emphasizing motion and enabling the network to localize dynamic regions relevant for gaze.

Ablation studies demonstrate that omitting the spatial attention module (SAM) leads to catastrophic degradation (angular error from 2.58° to 4.84°), while removing temporal recurrence yields a more moderate but still significant drop (to 2.88°), highlighting the necessity of both components (Personnic et al., 19 Dec 2025).

3. Training Objectives and Person-Specific Adaptation

ST-Gaze is trained using a composite loss function designed to support both 3D angular accuracy and 2D screen point-of-gaze (PoG) localization (Personnic et al., 19 Dec 2025, Jindal et al., 2024):

Angular Loss:

$L_{\text{ang}}(t) = \arccos \left( \frac{ \hat{y}_t \cdot g_t }{ \|\hat{y}_t\|\,\|g_t\| } \right)$

where $g_t$ and $\hat{y}_t$ are ground truth and predicted gaze vectors, respectively.

PoG Loss: 2D Euclidean distances in screen or pixel coordinates are computed if camera geometry permits.
Total Loss: A weighted sum combines these error terms, with the angular component dominating and the PoG term weighted by a small factor (practically, $\lambda_\text{ang}=1.0$ , $\lambda_\text{cm}=0.01$ ).
Person-Specific Adaptation: Post-hoc adaptation is realized differently across variants:
- SCPT approach (Personnic et al., 19 Dec 2025): Learn a per-person bias $\delta_p$ and affine transform $(W_p, b_p)$ from a small calibration set. Predictions are updated as $X_t\in\R^{160\times8\times8}$ 0.
- Gaussian Process Correction (Jindal et al., 2024): Collect a few calibration samples (typically $X_t\in\R^{160\times8\times8}$ 1), fit a GP prior in feature space, and use the predictive mean to correct the model’s output on a per-frame basis.

4. Comparative Performance and Generalization

ST-Gaze demonstrates competitive or superior results across multiple large-scale benchmarks, both in person-agnostic and person-adapted regimes. The figures reported in (Personnic et al., 19 Dec 2025, Jindal et al., 2024) include:

Method	Dataset	Angular Error (°)	Adaptation Method
ST-Gaze (Personnic et al., 19 Dec 2025)	EVE	2.58	SCPT (-)
ST-Gaze + SCPT	EVE	2.03 offline	SC/PT, small calibration
Hybrid-SAM+LSTM (Jindal et al., 2024)	Gaze360	10.05	Gaussian Process (GP)
Hybrid-SAM+LSTM+GP	EyeDiap	5.90 (with GP)	3 calibration samples
CapStARE (Samaniego et al., 24 Sep 2025)	ETH-XGaze	3.36	N/A

For Gaze360, the ST-Gaze architecture with Hybrid-SAM and LSTM achieves a mean angular error of 10.05°, outperforming prior approaches (e.g., EyeNet 12.53°, SwAT 11.60°). Incorporating personalization further improves performance by 0.8° on EyeDiap with three calibration frames (Jindal et al., 2024). On the EVE dataset, ST-Gaze with SCPT adaptation achieves state-of-the-art with an angular error of 1.87° in online adaptation (Personnic et al., 19 Dec 2025).

5. Algorithmic Workflow and Implementation Insights

The inference procedure in ST-Gaze follows an explicit stepwise flow (Personnic et al., 19 Dec 2025, Jindal et al., 2024):

$X_t\in\R^{160\times8\times8}$ 2

Training employs standard SGD with cosine annealing, batch sizes of 16, and sequences of 30 video frames. The ResNet-18 backbone is generally pretrained using contrastive gaze representation learning (GazeCLR) (Jindal et al., 2024).

6. Significance, Limitations, and Comparisons

Ablation analyses highlight critical findings:

Spatial attention is essential: Omission leads to severe performance drop (2.58°→4.84°).
Temporal modeling enhances generalization: Removing GRU raises error from 2.58° to 2.88° (Personnic et al., 19 Dec 2025).
Premature spatial pooling degrades performance: Pushing spatial scan after pooling yields 2.79°–3.13° error, confirming that sequence modeling across spatial positions before pooling is critical (Personnic et al., 19 Dec 2025).
Channel attention (ECA) is essential for adaptation: Removing ECA degrades SCPT performance (offline error 1.87°→1.97°), despite little effect on raw accuracy (Personnic et al., 19 Dec 2025).

The person-specific adaptation schemes (SCPT, Gaussian process correction) enable ST-Gaze to adapt with as few as three calibration samples, a property not matched by backbone-only methods. This enables both in-lab and in-the-wild personalization, which is critical for deployment in varied conditions (Jindal et al., 2024).

ST-Gaze advances over image-based methods and earlier video-based architectures by introducing explicit spatio-temporal recurrence and self-attention that preserve spatial order and propagate temporal states:

It surpasses the hybrid-ViT [Cheng & Lu 2021] and GRU-based EyeNet [Park 2020] in both angular and PoG metrics.
(Jindal et al., 2024) demonstrates that ST-Gaze with Hybrid-SAM and LSTM achieves substantially lower mean angular error than EyeNet and Concat-Residual, even without personalization.
Unlike CapStARE (Samaniego et al., 24 Sep 2025), which uses capsule formation and dual-stream GRUs for disentangled modeling of eye and head dynamics, or methods like SwAT and GazeCapsNet, ST-Gaze’s novel use of spatial scan + temporal carryover with transformers yields superior results on the EVE dataset.

A plausible implication is that long-range spatial attention, maintained in recurrent state and explicitly modeled without premature pooling, is a key determinant of high performance in gaze tracking from videos using consumer-grade cameras.

References:

(Personnic et al., 19 Dec 2025) "Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation"
(Jindal et al., 2024) "Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation"
(Samaniego et al., 24 Sep 2025) "CapStARE: Capsule-based Spatiotemporal Architecture for Robust and Efficient Gaze Estimation"