Papers
Topics
Authors
Recent
Search
2000 character limit reached

fMRI2GES: fMRI-to-Gesture Decoding

Updated 8 December 2025
  • fMRI2GES is a framework that decodes fMRI signals to reconstruct co-speech gestures and eye gaze, bridging neural imaging with behavioral synthesis.
  • It employs dual brain decoding alignment and conditional diffusion models to map high-dimensional fMRI data to structured behavioral outputs.
  • The system demonstrates enhanced accuracy by leveraging ROI-specific data and self-supervised training protocols for improved gesture and gaze reconstruction.

fMRI2GES refers to "fMRI-to-Gesture Estimation System," a class of frameworks for reconstructing temporally resolved behavioral outputs—such as eye gaze points or complex co-speech gestures—from functional magnetic resonance imaging (fMRI) data. Two notable paradigms are: (1) co-speech gesture reconstruction via Dual Brain Decoding Alignment as in "fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment" (Zhu et al., 1 Dec 2025) and (2) gaze point decoding from fMRI (exemplified by the MRGazer system) (Wu et al., 2023). Both represent advances in non-invasive brain decoding and multi-modal behavior synthesis, differing fundamentally in architectural and neuroscientific emphasis.

1. Problem Formulation and Mathematical Foundations

fMRI2GES systems formalize decoding as a multivariate regression or generative mapping from high-dimensional fMRI signal cfc_f to structured behavior yy:

  • Gestures: F2G:cfxF2G: c_f \mapsto x, where xRN×98x \in \mathbb{R}^{N \times 98} encodes NN frames of 2D keypoints (49 per frame) (Zhu et al., 1 Dec 2025).
  • Gaze: fθ:Xyf_\theta: X \mapsto y, mapping (3D or 4D) fMRI volume XX to gaze coordinates y=(yx,yy)R2y=(y_x, y_y) \in \mathbb{R}^2 (Wu et al., 2023).

Distinctly, the co-speech gesture variant must overcome a lack of paired {cf,x}\{c_f, x\} data, motivating indirect supervision via intermediate representations (e.g., text embeddings cxc_x) and multi-branch dual alignment.

Training objectives center on supervised mean-squared error (MSE) for direct regression and self-supervised noise-prediction losses for diffusion-based reconstruction. In DDPM-based gesture synthesis, the core objective is: LDDPM(θ)=EϵN(0,I),tϵϵθ(xt,t,c)22.\mathcal L_{\rm DDPM}(\theta) = \mathbb{E}_{\epsilon \sim \mathcal N(0, I), t} \left\| \epsilon - \epsilon_\theta(x_t, t, c) \right\|_2^2. For fMRI-to-gaze, the loss is standard MSE per output dimension: L(θ)=1Ni=1Ny(i)fθ(X(i))22.L(\theta) = \frac{1}{N} \sum_{i=1}^N \| y^{(i)} - f_\theta(X^{(i)}) \|_2^2.

2. Model Architectures and Learning Paradigms

fMRI-to-gaze (MRGazer):

  • Utilizes a two-stage pipeline: (1) morphological or deep-learning-based eyeball extraction; (2) residual 3D Res-Net regression per gaze dimension (Wu et al., 2023).
  • The fMRI input, preprocessed to isolate the eyeballs, is fed into parallel Res-Net12 architectures for xx and yy outputs, employing 3D convolutions with batch normalization and ReLU activations throughout.

fMRI-to-gesture (fMRI2GES):

  • Encompasses three modalities and pathways:
    • F2T (fMRI-to-text): Encoder is a ridge regression mapping from GPT-2 embeddings to observed fMRI patterns; decoder is GPT-2 with nucleus sampling and a learned “likelihood” scorer.
    • T2G (text-to-gesture): Conditional U-Net backbone for diffusion denoising, with cross-attention injecting word embeddings into each computational block.
    • F2G (fMRI-to-gesture): Mirrors T2G with cxc_x replaced by cfc_f, trained via self-supervised dual alignment loss.

The architecture leverages conditional DDPMs, with denoising U-Nets parameterized to condition on auxiliary input (text or fMRI). For gesture outputs, the noise-prediction targets are aligned in latent space under self-supervision.

3. Training Procedures and Dual Alignment

Supervised phases:

  • F2T: Trained on (cf,cx)(c_f, c_x) pairs (mapping fMRI to text).
  • T2G: Trained on textual speech embeddings and corresponding gesture pairs.

Self-supervised/unsupervised phase:

  • Dual Brain Decoding Alignment: Key innovation for fMRI-to-gesture in absence of direct pairs. From the same cfc_f, two gesture reconstructions are generated:
    • A) Cascaded branch: cfcxxtx^c_f \rightarrow c_x' \rightarrow x_t' \rightarrow \hat{x}, with x^\hat{x} acting as a pseudo-label.
    • B) Direct branch: cfxtx~c_f \rightarrow x_t'' \rightarrow \tilde{x} directly.
  • Self-supervised loss is computed as MSE between noise-prediction outputs of these branches at shared diffusion timesteps: Ldual(θx,θf)=Et,xt1αˉtαˉtϵθx(xt,cx,t)ϵθf(xt,cf,t)22.\mathcal L_{\rm dual}(\theta_x, \theta_f) = \mathbb{E}_{t, x_t'} \sqrt{ \frac{1 - \bar{\alpha}_t}{\bar{\alpha}_t} \| \epsilon_{\theta_x}(x_t', c_x', t) - \epsilon_{\theta_f}(x_t', c_f, t) \|_2^2 }.
  • Overall phase II objective blends unconditional diffusion loss on real (x,cx)(x, c_x) pairs with dual alignment on unpaired (cf,cx)(c_f, c_x').

Implementation pseudo-code and procedural details are explicit in Algorithm 1 in (Zhu et al., 1 Dec 2025).

4. Region-of-Interest Analysis and Neuroscientific Interpretation

Brain regional specificity is interrogated by restricting input fMRI voxels to various ROIs:

  • Auditory cortex and speech areas (Broca’s, area Spt) yield optimal gesture reconstruction (lower MAE, APE; higher PCK), outperforming “all” or “motor” (hand region) voxel subsets.
  • Whole-brain encoding models: Ridge regression from latent F2G network states to observed brain activity pinpoints information flow through EBA, IPS, S1H, M1H, and frontal eye fields, implying that high-level gesture representation is more tightly coupled to semantic/auditory networks than primary motor outputs.

This supports the embodied-cognition hypothesis that gestural intent during co-speech is encoded predominantly in auditory and language cortices.

5. Experimental Protocols and Quantitative Performance

Datasets

Task Dataset Modalities
F2T 7 subjects, 3T fMRI fMRI, text (story listening)
T2G 144 hr video, 10 spk. Text, 2D gesture keypoints
Gaze decoding HBN Biobank, OpenNeuro fMRI, gaze point labels

Key preprocessing includes upsampling word/gesture sequences to synchronize with fMRI temporal resolution, and cropping fMRI volumes to relevant ROIs.

Metrics

Metric Definition
MAE 1Nnx^nxn1\frac{1}{N}\sum_n \| \hat{x}_n - x_n \|_1
APE Average Euclidean keypoint error (gesture)
PCK@δ Proportion of keypoints within δ threshold
FGD Fréchet distance in gesture feature space
BC Beat Consistency
Diversity Number of unique gesture modes
Gaze: EE Euclidean error per gaze prediction
Gaze: rr Pearson correlation with ground-truth sequence

Key Results

  • Gesture decoding: Conditioning F2G on true cfc_f yields MAE reduction (0.929→0.603), PCK increase (0.206→0.451), and significant diversity improvement (90→164 gesture modes) relative to noise input (Zhu et al., 1 Dec 2025).
  • ROI effects: Auditory or speech area voxels yield best reconstruction, while "All" or "Motor" areas degrade performance.
  • Human evaluation: F2G outputs outperform five state-of-the-art gesture generators in naturalness, content relevance, and diversity by human ratings.
  • Gaze decoding (MRGazer): Achieves MAE_x = 1.11±0.69°, rx\mathbf{r}_x = 0.91, outperforming prior pipelines in both accuracy and throughput (~0.02 s/vol).

6. Implementation Practices and Computational Considerations

For gaze decoding, extraction of the eyeball ROI may be morphological or deep neural (3D Retina-Net), with Res-Net12 regressors per gaze dimension. Deployment on modern GPUs (e.g., NVIDIA Tesla V100) yields inference rates of 0.02s/volume, supporting near-real-time application (Wu et al., 2023).

In gesture decoding, the model architecture unifies UNet-like denoising diffusion models for both T2G and F2G, leveraging modern deep NLP (GPT-2) components for text synthesis and cross-modal alignment.

Software stack relies on PyTorch (≥1.8), with supplementary tools (scikit-image, nibabel, scikit-learn) for fMRI preprocessing, component analysis, and metrics.

7. Limitations and Future Directions

  • Temporal Resolution: Hemodynamic lag (∼10s) in fMRI necessitates upsampling of discrete behavioral sequences. Integration of high-temporal-resolution modalities (EEG/MEG) is a prospective solution (Zhu et al., 1 Dec 2025).
  • Modality Gap: Datasets for narration-driven fMRI and large-scale video (gestures, gaze) exhibit distributional gaps that may hamper generalization. Multi-domain adaptation and co-training protocols are suggested.
  • Extension Potential: Current systems produce only 2D behavioral outputs; extension to 3D gestures, richer facial or eye motor behavior, or even continuous sign language recognition is suggested.
  • ROI Dependence: Robustness of the decoding to individual anatomical/functional variability, head motion, and scanner-specific confounds remains an open question, with voxel segmentation or additional covariate modeling plausible future improvements.

A plausible implication is that future BCI applications for gesture or gaze restoration in impaired populations may directly benefit from these architectures, especially as multi-modal brain data and behavioral annotation sets proliferate. However, cross-domain synthesis and longitudinal stability have yet to be comprehensively demonstrated.


For further technical and neuroscientific details, see "fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment" (Zhu et al., 1 Dec 2025) and "MRGazer: Decoding Eye Gaze Points from Functional Magnetic Resonance Imaging in Individual Space" (Wu et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to fMRI2GES.