fMRI2GES: fMRI-to-Gesture Decoding
- fMRI2GES is a framework that decodes fMRI signals to reconstruct co-speech gestures and eye gaze, bridging neural imaging with behavioral synthesis.
- It employs dual brain decoding alignment and conditional diffusion models to map high-dimensional fMRI data to structured behavioral outputs.
- The system demonstrates enhanced accuracy by leveraging ROI-specific data and self-supervised training protocols for improved gesture and gaze reconstruction.
fMRI2GES refers to "fMRI-to-Gesture Estimation System," a class of frameworks for reconstructing temporally resolved behavioral outputs—such as eye gaze points or complex co-speech gestures—from functional magnetic resonance imaging (fMRI) data. Two notable paradigms are: (1) co-speech gesture reconstruction via Dual Brain Decoding Alignment as in "fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment" (Zhu et al., 1 Dec 2025) and (2) gaze point decoding from fMRI (exemplified by the MRGazer system) (Wu et al., 2023). Both represent advances in non-invasive brain decoding and multi-modal behavior synthesis, differing fundamentally in architectural and neuroscientific emphasis.
1. Problem Formulation and Mathematical Foundations
fMRI2GES systems formalize decoding as a multivariate regression or generative mapping from high-dimensional fMRI signal to structured behavior :
- Gestures: , where encodes frames of 2D keypoints (49 per frame) (Zhu et al., 1 Dec 2025).
- Gaze: , mapping (3D or 4D) fMRI volume to gaze coordinates (Wu et al., 2023).
Distinctly, the co-speech gesture variant must overcome a lack of paired data, motivating indirect supervision via intermediate representations (e.g., text embeddings ) and multi-branch dual alignment.
Training objectives center on supervised mean-squared error (MSE) for direct regression and self-supervised noise-prediction losses for diffusion-based reconstruction. In DDPM-based gesture synthesis, the core objective is: For fMRI-to-gaze, the loss is standard MSE per output dimension:
2. Model Architectures and Learning Paradigms
fMRI-to-gaze (MRGazer):
- Utilizes a two-stage pipeline: (1) morphological or deep-learning-based eyeball extraction; (2) residual 3D Res-Net regression per gaze dimension (Wu et al., 2023).
- The fMRI input, preprocessed to isolate the eyeballs, is fed into parallel Res-Net12 architectures for and outputs, employing 3D convolutions with batch normalization and ReLU activations throughout.
fMRI-to-gesture (fMRI2GES):
- Encompasses three modalities and pathways:
- F2T (fMRI-to-text): Encoder is a ridge regression mapping from GPT-2 embeddings to observed fMRI patterns; decoder is GPT-2 with nucleus sampling and a learned “likelihood” scorer.
- T2G (text-to-gesture): Conditional U-Net backbone for diffusion denoising, with cross-attention injecting word embeddings into each computational block.
- F2G (fMRI-to-gesture): Mirrors T2G with replaced by , trained via self-supervised dual alignment loss.
The architecture leverages conditional DDPMs, with denoising U-Nets parameterized to condition on auxiliary input (text or fMRI). For gesture outputs, the noise-prediction targets are aligned in latent space under self-supervision.
3. Training Procedures and Dual Alignment
Supervised phases:
- F2T: Trained on pairs (mapping fMRI to text).
- T2G: Trained on textual speech embeddings and corresponding gesture pairs.
Self-supervised/unsupervised phase:
- Dual Brain Decoding Alignment: Key innovation for fMRI-to-gesture in absence of direct pairs. From the same , two gesture reconstructions are generated:
- A) Cascaded branch: , with acting as a pseudo-label.
- B) Direct branch: directly.
- Self-supervised loss is computed as MSE between noise-prediction outputs of these branches at shared diffusion timesteps:
- Overall phase II objective blends unconditional diffusion loss on real pairs with dual alignment on unpaired .
Implementation pseudo-code and procedural details are explicit in Algorithm 1 in (Zhu et al., 1 Dec 2025).
4. Region-of-Interest Analysis and Neuroscientific Interpretation
Brain regional specificity is interrogated by restricting input fMRI voxels to various ROIs:
- Auditory cortex and speech areas (Broca’s, area Spt) yield optimal gesture reconstruction (lower MAE, APE; higher PCK), outperforming “all” or “motor” (hand region) voxel subsets.
- Whole-brain encoding models: Ridge regression from latent F2G network states to observed brain activity pinpoints information flow through EBA, IPS, S1H, M1H, and frontal eye fields, implying that high-level gesture representation is more tightly coupled to semantic/auditory networks than primary motor outputs.
This supports the embodied-cognition hypothesis that gestural intent during co-speech is encoded predominantly in auditory and language cortices.
5. Experimental Protocols and Quantitative Performance
Datasets
| Task | Dataset | Modalities |
|---|---|---|
| F2T | 7 subjects, 3T fMRI | fMRI, text (story listening) |
| T2G | 144 hr video, 10 spk. | Text, 2D gesture keypoints |
| Gaze decoding | HBN Biobank, OpenNeuro | fMRI, gaze point labels |
Key preprocessing includes upsampling word/gesture sequences to synchronize with fMRI temporal resolution, and cropping fMRI volumes to relevant ROIs.
Metrics
| Metric | Definition |
|---|---|
| MAE | |
| APE | Average Euclidean keypoint error (gesture) |
| PCK@δ | Proportion of keypoints within δ threshold |
| FGD | Fréchet distance in gesture feature space |
| BC | Beat Consistency |
| Diversity | Number of unique gesture modes |
| Gaze: EE | Euclidean error per gaze prediction |
| Gaze: | Pearson correlation with ground-truth sequence |
Key Results
- Gesture decoding: Conditioning F2G on true yields MAE reduction (0.929→0.603), PCK increase (0.206→0.451), and significant diversity improvement (90→164 gesture modes) relative to noise input (Zhu et al., 1 Dec 2025).
- ROI effects: Auditory or speech area voxels yield best reconstruction, while "All" or "Motor" areas degrade performance.
- Human evaluation: F2G outputs outperform five state-of-the-art gesture generators in naturalness, content relevance, and diversity by human ratings.
- Gaze decoding (MRGazer): Achieves MAE_x = 1.11±0.69°, = 0.91, outperforming prior pipelines in both accuracy and throughput (~0.02 s/vol).
6. Implementation Practices and Computational Considerations
For gaze decoding, extraction of the eyeball ROI may be morphological or deep neural (3D Retina-Net), with Res-Net12 regressors per gaze dimension. Deployment on modern GPUs (e.g., NVIDIA Tesla V100) yields inference rates of 0.02s/volume, supporting near-real-time application (Wu et al., 2023).
In gesture decoding, the model architecture unifies UNet-like denoising diffusion models for both T2G and F2G, leveraging modern deep NLP (GPT-2) components for text synthesis and cross-modal alignment.
Software stack relies on PyTorch (≥1.8), with supplementary tools (scikit-image, nibabel, scikit-learn) for fMRI preprocessing, component analysis, and metrics.
7. Limitations and Future Directions
- Temporal Resolution: Hemodynamic lag (∼10s) in fMRI necessitates upsampling of discrete behavioral sequences. Integration of high-temporal-resolution modalities (EEG/MEG) is a prospective solution (Zhu et al., 1 Dec 2025).
- Modality Gap: Datasets for narration-driven fMRI and large-scale video (gestures, gaze) exhibit distributional gaps that may hamper generalization. Multi-domain adaptation and co-training protocols are suggested.
- Extension Potential: Current systems produce only 2D behavioral outputs; extension to 3D gestures, richer facial or eye motor behavior, or even continuous sign language recognition is suggested.
- ROI Dependence: Robustness of the decoding to individual anatomical/functional variability, head motion, and scanner-specific confounds remains an open question, with voxel segmentation or additional covariate modeling plausible future improvements.
A plausible implication is that future BCI applications for gesture or gaze restoration in impaired populations may directly benefit from these architectures, especially as multi-modal brain data and behavioral annotation sets proliferate. However, cross-domain synthesis and longitudinal stability have yet to be comprehensively demonstrated.
For further technical and neuroscientific details, see "fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment" (Zhu et al., 1 Dec 2025) and "MRGazer: Decoding Eye Gaze Points from Functional Magnetic Resonance Imaging in Individual Space" (Wu et al., 2023).