Papers
Topics
Authors
Recent
2000 character limit reached

GazeShiftNet: Shift-Invariant CNN for Gaze Mapping

Updated 30 December 2025
  • GazeShiftNet is a compact convolutional neural network that uses shift-augmented training and translation-equivariant convolutions to achieve robust gaze mapping in photosensor oculography systems.
  • It features a lightweight architecture with two convolutional layers and four fully connected layers, maintaining full spatial resolution by omitting pooling and batch normalization.
  • Initialization strategies, including from-scratch training and fine-tuning with a leave-one-subject-out protocol, ensure resource-efficient calibration and adaptability to sensor shifts.

GazeShiftNet is a compact convolutional neural network (CNN) proposed for robust, shift-invariant gaze mapping from photosensor outputs in photosensor oculography (PS-OG) systems. Its primary application is in head-mounted devices (HMDs) requiring low power consumption and reliable performance under device shifts, as PS-OG sensors may experience performance degradation due to movement or repositioning of the headset. GazeShiftNet leverages the translation-equivariance of convolutional layers, shift-augmented training, and multi-stage initialization to enhance gaze prediction accuracy within and slightly beyond the trained range of sensor translations (Griffith et al., 2019).

1. Layerwise Architecture of GazeShiftNet

GazeShiftNet is defined by a resource-constrained design optimized for embedded HMD settings. The layer-by-layer structure is as follows:

Stage Configuration Activation
Input 3×5×1 array (PS-OG sensor intensities) None
Conv-1 4 filters ReLU
Conv-2 4 filters ReLU
Flatten - -
FC-1 20 units ReLU
FC-2 20 units ReLU
FC-3 20 units ReLU
FC-4 20 units ReLU
Output 2 units (gaze_x, gaze_y) Linear

The network ingests one "image" per inference, corresponding to the raw, unnormalized 3×5 grid of PS-OG sensor signals. No per-pixel preprocessing is performed. The convolutional front end comprises two layers, each producing four feature maps, with details such as kernel size and stride determined via grid search but not reported. Max pooling, batch normalization, dropout, and residual/skip connections are not utilized. The feature maps from Conv-2 are flattened before passing through four fully connected (FC) layers, each with 20 units and ReLU activation. The output head returns two continuous values representing horizontal and vertical gaze angles (degrees), with a fully connected linear regression layer.

2. Shift-Invariance Mechanism

GazeShiftNet achieves shift-invariant performance primarily through its reliance on convolutional operations and data-driven training procedures. No explicit regularization or losses enforcing translation-invariance are introduced. The model architecture capitalizes on the translation-equivariance of 2D convolutions:

zi,j(k)=∑m∑nWm,n(k)xi+m, j+n+b(k),ai,j(k)=ReLU(zi,j(k)).z_{i,j}^{(k)} = \sum_{m}\sum_{n}W^{(k)}_{m,n}x_{i+m,\,j+n} + b^{(k)}, \quad a_{i,j}^{(k)} = \mathrm{ReLU}(z_{i,j}^{(k)}).

Shift robustness is induced by augmenting the training data with random ±2 mm (or, in extended settings, ±5 mm) translations of the PS-OG "image" crops. Under this regime, the network parameters are optimized to map shifted sensor readings to the correct gaze coordinates, relying on the translation-equivariance of convolutions to generalize across expected device movements. No pooling is present, so full spatial resolution is preserved throughout the convolutional stages (Griffith et al., 2019).

3. Training Objective and Optimization

GazeShiftNet is trained under a mean squared error (MSE) regression objective, directly predicting the gaze angles from PS-OG sensor intensities:

LMSE=1N∑i=1N[(x^i−xi)2+(y^i−yi)2],\mathcal{L}_{\mathrm{MSE}} = \frac1N \sum_{i=1}^N \left[(\hat x_i - x_i)^2 + (\hat y_i - y_i)^2\right],

where xix_i and yiy_i are ground-truth horizontal and vertical gaze angles, and x^i\hat x_i, y^i\hat y_i are the corresponding predicted values. The output layer uses a linear activation to satisfy the regression requirement. The loss is optimized over examples spanning the target distribution of sensor shifts, thereby enforcing mapping consistency across plausible device configurations encountered in the field.

4. Initialization Strategies and Transferability

Two training protocols are evaluated for model initialization:

  • From-Scratch (FS): All weights randomly initialized, with subsequent optimization performed solely on subject-specific data.
  • Fine-Tuning (FT): A leave-one-subject-out protocol in which pre-training is performed on data from all subjects except the current user, followed by subject-specific fine-tuning.

The FT strategy, which implements a transfer-and-tune paradigm, accelerates convergence and improves consistency when limited calibration data per subject are available. In this scheme, the network benefits from prior exposure to generic eye-to-gaze mappings, reducing the number of epochs required for adaptation to new users by roughly half compared to FS, while achieving comparable accuracy (Griffith et al., 2019).

5. Resource Efficiency and Deployment Considerations

GazeShiftNet is explicitly designed for low-power, low-latency scenarios typical of embedded HMD systems. Its compact size—merely two small convolutional layers and four 20-unit FC layers—ensures compatibility with hardware resource constraints and real-time requirements. No resource-intensive regularization or architectural elements (such as pooling, batchnorm, dropout, or residual connections) are employed, which further aligns the model with on-device inference or edge deployment constraints.

6. Practical Outcomes and Limitations

Experiments confirm that GazeShiftNet maintains robust mapping accuracy when trained with sensor-position shifts sampled from distributions that reflect manual HMD repositioning, and accuracy remains reasonable even for shifts marginally exceeding those introduced during training. This contributes to the practical feasibility of in-field setup and short per-user calibration sessions. No explicit techniques to enforce shift-invariance beyond data augmentation and architectural choices are incorporated; shift robustness emerges via training on sensor-shifted examples. Notably, no information is provided about the optimal kernel sizes, strides, or filter shapes, which may limit reproducibility. The architecture is sequential and minimal, prioritizing hardware compatibility and calibration efficiency over exhaustive feature capacity (Griffith et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to GazeShiftNet Architecture.