Papers
Topics
Authors
Recent
2000 character limit reached

fMRI2GES: fMRI-to-Gesture Decoding

Updated 8 December 2025
  • fMRI2GES is a framework that decodes fMRI signals to reconstruct co-speech gestures and eye gaze, bridging neural imaging with behavioral synthesis.
  • It employs dual brain decoding alignment and conditional diffusion models to map high-dimensional fMRI data to structured behavioral outputs.
  • The system demonstrates enhanced accuracy by leveraging ROI-specific data and self-supervised training protocols for improved gesture and gaze reconstruction.

fMRI2GES refers to "fMRI-to-Gesture Estimation System," a class of frameworks for reconstructing temporally resolved behavioral outputs—such as eye gaze points or complex co-speech gestures—from functional magnetic resonance imaging (fMRI) data. Two notable paradigms are: (1) co-speech gesture reconstruction via Dual Brain Decoding Alignment as in "fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment" (Zhu et al., 1 Dec 2025) and (2) gaze point decoding from fMRI (exemplified by the MRGazer system) (Wu et al., 2023). Both represent advances in non-invasive brain decoding and multi-modal behavior synthesis, differing fundamentally in architectural and neuroscientific emphasis.

1. Problem Formulation and Mathematical Foundations

fMRI2GES systems formalize decoding as a multivariate regression or generative mapping from high-dimensional fMRI signal cfc_f to structured behavior yy:

  • Gestures: F2G:cfxF2G: c_f \mapsto x, where xRN×98x \in \mathbb{R}^{N \times 98} encodes NN frames of 2D keypoints (49 per frame) (Zhu et al., 1 Dec 2025).
  • Gaze: fθ:Xyf_\theta: X \mapsto y, mapping (3D or 4D) fMRI volume XX to gaze coordinates y=(yx,yy)R2y=(y_x, y_y) \in \mathbb{R}^2 (Wu et al., 2023).

Distinctly, the co-speech gesture variant must overcome a lack of paired {cf,x}\{c_f, x\} data, motivating indirect supervision via intermediate representations (e.g., text embeddings cxc_x) and multi-branch dual alignment.

Training objectives center on supervised mean-squared error (MSE) for direct regression and self-supervised noise-prediction losses for diffusion-based reconstruction. In DDPM-based gesture synthesis, the core objective is: LDDPM(θ)=EϵN(0,I),tϵϵθ(xt,t,c)22.\mathcal L_{\rm DDPM}(\theta) = \mathbb{E}_{\epsilon \sim \mathcal N(0, I), t} \left\| \epsilon - \epsilon_\theta(x_t, t, c) \right\|_2^2. For fMRI-to-gaze, the loss is standard MSE per output dimension: L(θ)=1Ni=1Ny(i)fθ(X(i))22.L(\theta) = \frac{1}{N} \sum_{i=1}^N \| y^{(i)} - f_\theta(X^{(i)}) \|_2^2.

2. Model Architectures and Learning Paradigms

fMRI-to-gaze (MRGazer):

  • Utilizes a two-stage pipeline: (1) morphological or deep-learning-based eyeball extraction; (2) residual 3D Res-Net regression per gaze dimension (Wu et al., 2023).
  • The fMRI input, preprocessed to isolate the eyeballs, is fed into parallel Res-Net12 architectures for xx and yy outputs, employing 3D convolutions with batch normalization and ReLU activations throughout.

fMRI-to-gesture (fMRI2GES):

  • Encompasses three modalities and pathways:
    • F2T (fMRI-to-text): Encoder is a ridge regression mapping from GPT-2 embeddings to observed fMRI patterns; decoder is GPT-2 with nucleus sampling and a learned “likelihood” scorer.
    • T2G (text-to-gesture): Conditional U-Net backbone for diffusion denoising, with cross-attention injecting word embeddings into each computational block.
    • F2G (fMRI-to-gesture): Mirrors T2G with cxc_x replaced by cfc_f, trained via self-supervised dual alignment loss.

The architecture leverages conditional DDPMs, with denoising U-Nets parameterized to condition on auxiliary input (text or fMRI). For gesture outputs, the noise-prediction targets are aligned in latent space under self-supervision.

3. Training Procedures and Dual Alignment

Supervised phases:

  • F2T: Trained on (cf,cx)(c_f, c_x) pairs (mapping fMRI to text).
  • T2G: Trained on textual speech embeddings and corresponding gesture pairs.

Self-supervised/unsupervised phase:

  • Dual Brain Decoding Alignment: Key innovation for fMRI-to-gesture in absence of direct pairs. From the same cfc_f, two gesture reconstructions are generated:
    • A) Cascaded branch: cfcxxtx^c_f \rightarrow c_x' \rightarrow x_t' \rightarrow \hat{x}, with x^\hat{x} acting as a pseudo-label.
    • B) Direct branch: cfxtx~c_f \rightarrow x_t'' \rightarrow \tilde{x} directly.
  • Self-supervised loss is computed as MSE between noise-prediction outputs of these branches at shared diffusion timesteps: Ldual(θx,θf)=Et,xt1αˉtαˉtϵθx(xt,cx,t)ϵθf(xt,cf,t)22.\mathcal L_{\rm dual}(\theta_x, \theta_f) = \mathbb{E}_{t, x_t'} \sqrt{ \frac{1 - \bar{\alpha}_t}{\bar{\alpha}_t} \| \epsilon_{\theta_x}(x_t', c_x', t) - \epsilon_{\theta_f}(x_t', c_f, t) \|_2^2 }.
  • Overall phase II objective blends unconditional diffusion loss on real (x,cx)(x, c_x) pairs with dual alignment on unpaired (cf,cx)(c_f, c_x').

Implementation pseudo-code and procedural details are explicit in Algorithm 1 in (Zhu et al., 1 Dec 2025).

4. Region-of-Interest Analysis and Neuroscientific Interpretation

Brain regional specificity is interrogated by restricting input fMRI voxels to various ROIs:

  • Auditory cortex and speech areas (Broca’s, area Spt) yield optimal gesture reconstruction (lower MAE, APE; higher PCK), outperforming “all” or “motor” (hand region) voxel subsets.
  • Whole-brain encoding models: Ridge regression from latent F2G network states to observed brain activity pinpoints information flow through EBA, IPS, S1H, M1H, and frontal eye fields, implying that high-level gesture representation is more tightly coupled to semantic/auditory networks than primary motor outputs.

This supports the embodied-cognition hypothesis that gestural intent during co-speech is encoded predominantly in auditory and language cortices.

5. Experimental Protocols and Quantitative Performance

Datasets

Task Dataset Modalities
F2T 7 subjects, 3T fMRI fMRI, text (story listening)
T2G 144 hr video, 10 spk. Text, 2D gesture keypoints
Gaze decoding HBN Biobank, OpenNeuro fMRI, gaze point labels

Key preprocessing includes upsampling word/gesture sequences to synchronize with fMRI temporal resolution, and cropping fMRI volumes to relevant ROIs.

Metrics

Metric Definition
MAE 1Nnx^nxn1\frac{1}{N}\sum_n \| \hat{x}_n - x_n \|_1
APE Average Euclidean keypoint error (gesture)
PCK@δ Proportion of keypoints within δ threshold
FGD Fréchet distance in gesture feature space
BC Beat Consistency
Diversity Number of unique gesture modes
Gaze: EE Euclidean error per gaze prediction
Gaze: rr Pearson correlation with ground-truth sequence

Key Results

  • Gesture decoding: Conditioning F2G on true cfc_f yields MAE reduction (0.929→0.603), PCK increase (0.206→0.451), and significant diversity improvement (90→164 gesture modes) relative to noise input (Zhu et al., 1 Dec 2025).
  • ROI effects: Auditory or speech area voxels yield best reconstruction, while "All" or "Motor" areas degrade performance.
  • Human evaluation: F2G outputs outperform five state-of-the-art gesture generators in naturalness, content relevance, and diversity by human ratings.
  • Gaze decoding (MRGazer): Achieves MAE_x = 1.11±0.69°, rx\mathbf{r}_x = 0.91, outperforming prior pipelines in both accuracy and throughput (~0.02 s/vol).

6. Implementation Practices and Computational Considerations

For gaze decoding, extraction of the eyeball ROI may be morphological or deep neural (3D Retina-Net), with Res-Net12 regressors per gaze dimension. Deployment on modern GPUs (e.g., NVIDIA Tesla V100) yields inference rates of 0.02s/volume, supporting near-real-time application (Wu et al., 2023).

In gesture decoding, the model architecture unifies UNet-like denoising diffusion models for both T2G and F2G, leveraging modern deep NLP (GPT-2) components for text synthesis and cross-modal alignment.

Software stack relies on PyTorch (≥1.8), with supplementary tools (scikit-image, nibabel, scikit-learn) for fMRI preprocessing, component analysis, and metrics.

7. Limitations and Future Directions

  • Temporal Resolution: Hemodynamic lag (∼10s) in fMRI necessitates upsampling of discrete behavioral sequences. Integration of high-temporal-resolution modalities (EEG/MEG) is a prospective solution (Zhu et al., 1 Dec 2025).
  • Modality Gap: Datasets for narration-driven fMRI and large-scale video (gestures, gaze) exhibit distributional gaps that may hamper generalization. Multi-domain adaptation and co-training protocols are suggested.
  • Extension Potential: Current systems produce only 2D behavioral outputs; extension to 3D gestures, richer facial or eye motor behavior, or even continuous sign language recognition is suggested.
  • ROI Dependence: Robustness of the decoding to individual anatomical/functional variability, head motion, and scanner-specific confounds remains an open question, with voxel segmentation or additional covariate modeling plausible future improvements.

A plausible implication is that future BCI applications for gesture or gaze restoration in impaired populations may directly benefit from these architectures, especially as multi-modal brain data and behavioral annotation sets proliferate. However, cross-domain synthesis and longitudinal stability have yet to be comprehensively demonstrated.


For further technical and neuroscientific details, see "fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment" (Zhu et al., 1 Dec 2025) and "MRGazer: Decoding Eye Gaze Points from Functional Magnetic Resonance Imaging in Individual Space" (Wu et al., 2023).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to fMRI2GES.