fMRI-to-Gesture Reconstruction Networks

Updated 8 December 2025

The paper presents an end-to-end framework that maps raw fMRI volumes to 2D gaze coordinates without relying on template-based co-registration.
It utilizes advanced techniques like morphological eyeball isolation, 3D Retina-Net localization, and residual network regression to enhance computational efficiency and accuracy.
Results indicate improved metrics including lower mean absolute error and higher Pearson correlation, enabling robust, real-time gaze tracking across varied cognitive tasks.

fMRI2GES refers to the automated reconstruction of eye gaze points from functional magnetic resonance imaging (fMRI) data, as operationalized in the MRGazer framework. This approach dispenses with traditional co-registration to a standard template and instead achieves gaze estimation in “individual space” through end-to-end learning, with substantial advances in both computational efficiency and accuracy compared to prior art that relies on co-registered eyeball masks. The method is characterized by an initial step of eyeball localization—either via morphological analysis or a trainable 3D Retina-Net—followed by residual network-based gaze regression. fMRI2GES establishes a robust, template-free pipeline for camera-less gaze tracking in cognitive neuroscience experiments (Wu et al., 2023).

1. Formal Problem Setting

The core task is to learn a mapping from raw or minimally processed fMRI volumes to the corresponding 2D eye gaze point. Given an fMRI sample $X \in \mathbb{R}^{H \times W \times D}$ (or $X \in \mathbb{R}^{H \times W \times D \times T}$ for time sequences), the goal is to predict $y = (y_x, y_y)\in \mathbb{R}^2$ , the gaze coordinates on the stimulus screen. The mapping is implemented by a parameterized function $f_\theta$ , learning to minimize the mean squared error: $L(\theta) = \frac{1}{N} \sum_{i=1}^N \left\| y^{(i)} - f_\theta \bigl( X^{(i)} \bigr) \right\|_2^2 = \frac{1}{N} \sum_{i=1}^N \left[ (y_x^{(i)} - \hat{y}_x^{(i)})^2 + (y_y^{(i)} - \hat{y}_y^{(i)})^2 \right]$ where $\hat{y}_i = f_\theta(X_i)$ . This direct regression framing enables applicability across a diverse array of tasks, from controlled fixation to naturalistic viewing and saccade paradigms.

2. Eyeball Localization and Preprocessing

Accurate eyeball localization is fundamental, as downstream gaze inference exploits local fMRI signals from the ocular globes. Two main strategies are defined:

Morphological Eyeball Isolation (Offline): After standard motion correction, a rough region of interest (ROI) is binarized at $T_R \times \gamma$ , where $T_R = \text{mean}_{\text{voxel} \in ROI}(X)$ and $\gamma \approx 1.0$ . A $3 \times 3 \times 3$ morphological opening operation disconnects brain and ocular signals, and connected-component analysis isolates the two largest non-brain clusters as eyeball candidates. Axis-aligned bounding boxes (usually $48\times30\times24$ voxels) are extracted per eye.
3D Retina-Net Eyeball Detector (Online): A trainable pipeline uses a full-brain input (common: $64\times64\times48$ ) and processes through 3D Res-Net blocks. An FPN fuses multiscale features, and per-level heads predict "eye" versus "background" via focal loss and refine bounding boxes through regression. This module achieves operation at $\sim$ 0.01 s/volume in inference.

These approaches eliminate the need for template-based co-registration and promote efficient processing at scale (Wu et al., 2023).

3. Residual Network-Based Gaze Regression

Following extraction, each cropped eyeball volume serves as input to a residual network regressor. The standard architecture involves two identical 12-layer 3D Res-Nets (one per spatial direction):

Initial 3D convolution: (kernel $5\times4\times4$ , 64 filters)
Three stacked residual blocks: each consists of two applied $3\times3\times3$ convolutions with skip connections
Global average pooling, flattening
Sequential fully connected layers (FC(512) $\to$ ReLU $\to$ FC(2048) $\to$ ReLU $\to$ FC(512) $\to$ ReLU)
Final linear output (no activation) to yield the predicted gaze coordinate

When outlier detection is required, the backbone remains but with binary classification heads and a cross-entropy loss. No geometric augmentations are applied by default, but random left/right flips are feasible.

4. Training Protocols, Datasets, and Metrics

fMRI2GES is validated across multiple datasets and gaze tasks:

Datasets:
- Fixation (HBN Biobank Peer1–3): hundreds of volumes per subject, labeled for 27 discrete screen locations.
- Naturalistic viewing: e.g., “The Present” (∼214 volumes, $N=1351$ participants for ground-truth eye tracker average).
- Pro-/anti-saccade (OpenNeuro ds000120/119): both training and held-out testing for generalization.
Splits: Across-scan (train on some Peer datasets, test on others), across-individual (random split with held-out subjects), and fivefold (for saccade data).
Optimization: Adam; regression lr $= 5\times10^{-4}$ , batch size 128, early-stop at 60 epochs.
Performance Metrics:
- Mean absolute error (MAE, in visual degrees)
- Pearson correlation $r$ for each direction
- Euclidean error per volume
- Saccade: pixel-distance error, precision, recall, F1

Task/Method	MAE $_x$ (°)	$r_x$	MAE $_y$ (°)	$r_y$
MRGazer (ResNet12)	1.11±0.69	0.91±0.10	1.12±0.48	0.87±0.11

Top quartile outlier removal further refines error metrics. Naturalistic viewing yields $r_x=0.88,\ r_y=0.84$ against averaged eye tracking. Runtime is $\sim$ 0.02 s/volume (morphological extraction + Retina-Net + regression), markedly faster than template-based DeepMReye ( $\sim$ 0.3 s/volume).

5. Task-Specific Results and Comparative Analysis

In controlled fixation, MRGazer surpasses DeepMReye on Pearson- $r$ and achieves MAE $_x\approx1.11^\circ$ , MAE $_y\approx1.12^\circ$ (Wu et al., 2023). Across-individual test splits confirm generalizability (MAE $_x = 1.70 \pm 0.82^\circ$ , $r=0.84$ ). In naturalistic viewing, correlation with eye tracker ground truth is high. For pro-/anti-saccade, accuracy is competitive: anti-saccade "visual guide" achieves F1 of 88.5%.

The pipeline is robust to low SNR but sensitivity to anatomical variance or head motion remains. Outlier detection via classification achieves F1 of 88.2%.

A plausible implication is that individualized preprocessing and learning-based spatial localization are critical for robust, generalizable fMRI2GES performance across heterogeneous acquisition settings.

6. Technical Implementation and Environment

MRGazer is implemented under a standard modern deep learning stack. Key details:

Hardware: NVIDIA Tesla V100 (16 GB), Intel Xeon Gold 6230R, 32 GB RAM.
Software: Ubuntu 18.04, Python 3.8, PyTorch 1.8 + CUDA 10.2, torchvision, scikit-image, nibabel, numpy, scipy, scikit-learn.
Modular codebase:

1. fmri_preproc.py: motion correction, intensity normalization 2. eye_extract.py: ROI / bounding box logic 3. retina_net.py: Retina-Net training/inference 4. resnet_gaze.py: 3D Res-Net regression/outlier detection 5. train.py / eval.py: orchestration

Total pipeline achieves real-time ( $\sim$ 50 Hz) operation for single-volume inference.

7. Limitations and Prospects

Eyeball Extraction Robustness: Morphological methods may fail under low-SNR or anatomical variability. Parameter $\gamma$ tuning and alternative segmenters (e.g., 3D U-Net) are advised.
Regression Heads: Separate networks for x/y could be unified to improve parameter efficiency and reduce overfitting.
Generalization: Screen calibration parameters are assumed constant; cross-scanner adaptation requires new strategies.
Head Motion and Demographics: Performance is degraded in high-motion or young subjects. Integrating movement estimates or demographic covariates may mitigate this.
Future Extensions: Performance on blinks or pursuit, higher temporal resolution, and leveraging multimodal (e.g., anatomical, head-pose) inputs are active areas of investigation.

By combining bespoke spatial preprocessing, lightweight end-to-end learning, and carefully benchmarked evaluation, fMRI2GES establishes the state-of-the-art for non-invasive, template-free eye tracking from fMRI, and offers a modular platform for further exploration of visual attention and oculomotor control (Wu et al., 2023).

PDF Markdown Chat (Pro)

References (1)

MRGazer: Decoding Eye Gaze Points from Functional Magnetic Resonance Imaging in Individual Space (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to fMRI-to-Gesture Reconstruction Networks.