Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRI (2505.14556v1)

Published 20 May 2025 in cs.CV

Abstract: Brain-to-image decoding has been recently propelled by the progress in generative AI models and the availability of large ultra-high field functional Magnetic Resonance Imaging (fMRI). However, current approaches depend on complicated multi-stage pipelines and preprocessing steps that typically collapse the temporal dimension of brain recordings, thereby limiting time-resolved brain decoders. Here, we introduce Dynadiff (Dynamic Neural Activity Diffusion for Image Reconstruction), a new single-stage diffusion model designed for reconstructing images from dynamically evolving fMRI recordings. Our approach offers three main contributions. First, Dynadiff simplifies training as compared to existing approaches. Second, our model outperforms state-of-the-art models on time-resolved fMRI signals, especially on high-level semantic image reconstruction metrics, while remaining competitive on preprocessed fMRI data that collapse time. Third, this approach allows a precise characterization of the evolution of image representations in brain activity. Overall, this work lays the foundation for time-resolved brain-to-image decoding.

PDF Abstract

This paper, "Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRI" (Careil et al., 20 May 2025 ), introduces a novel pipeline for reconstructing images from fMRI brain activity. It addresses two key challenges in the field: the common practice of preprocessing fMRI data in a way that discards its temporal dimension (e.g., using time-collapsed beta values) and the increasing complexity of multi-stage decoding pipelines. Dynadiff proposes a simplified, single-stage approach that operates directly on the continuous fMRI time-series data and achieves state-of-the-art performance.

The core idea of Dynadiff is to directly fine-tune a pretrained image-generation diffusion model using fMRI signals as conditioning information. Unlike previous methods that might involve separate training stages for fMRI encoding, feature alignment, and image generation, Dynadiff trains a single brain module jointly with a modified diffusion model.

The proposed method consists of two main components:

Brain Module: This module is designed to project the fMRI time-series data into the conditioning space of the diffusion model. The input to the brain module is an fMRI time-series $X \in {\rm I\!R}^{C \times T}$ $X \in I R^{C \times T}$ , where $C$ $C$ is the number of voxels and $T$ $T$ is the number of time samples in a given window. The module's architecture includes:
- A subject-specific linear layer to project each volume (of $C$ voxels) to 1552 channels, maintaining the $T$ time samples.
- A timestep-specific linear layer, applying distinct weights per time sample.
- Layer normalization, GELU activation, and dropout.
- A linear temporal aggregation layer to merge the temporal dimension.
- A final linear layer that outputs an fMRI embedding with the same shape as the target conditional embedding (e.g., 257 patches, 768 channels, matching CLIP embeddings). This brain module is a single MLP block containing approximately 400M parameters.
Brain-conditioned Diffusion Model: The paper uses a pretrained latent diffusion model (similar to those used in [ozcelik2023brain, scotti2023reconstructing]) which was originally trained to generate images conditioned on text and image embeddings (e.g., from CLIP [radford2021learning]). Dynadiff replaces the image embeddings with the fMRI embeddings generated by the brain module. Text embeddings are set to null (or a learned constant embedding for classifier-free guidance). This allows leveraging the pretrained model's generative capabilities.

The crucial aspect of Dynadiff is its single-stage training. The brain module and LoRA adapters [hu2021lora] applied to the cross-attention layers of the diffusion model (approximately 25M additional parameters) are trained jointly from scratch, while the main weights of the pretrained diffusion model remain frozen. The training optimizes a standard diffusion loss. Techniques like bicubic sampling for early timesteps and offset noise are used. Classifier-free guidance is enabled by using null conditioning in 10% of training iterations.

For inference, the fMRI time-series is passed through the trained brain module to get the fMRI embedding. This embedding conditions the diffusion model's U-Net, which denoises an initial random Gaussian noise over a set number of steps (e.g., 20 DDIM steps with a guidance scale of 3). The resulting latent embedding is then decoded by the autoencoder to produce the reconstructed image.

The experiments are conducted using the Natural Scenes Dataset (NSD) [allen2022massive], which provides 7T fMRI data from participants viewing a large set of natural images. The paper focuses on data from subjects who completed all 40 sessions (subjects 1, 2, 5, and 7), totaling 27,000 training trials and 3,000 test trials per subject (from 10,000 unique images, each presented three times). Crucially, Dynadiff uses the standard-resolution BOLD fMRI time-series data, which maintains the temporal dimension, unlike many previous methods that used time-collapsed beta values. The preprocessing involves temporal upsampling, spatial resampling, detrending, and z-score normalization within a posterior cortex ROI (nsdgeneral).

The paper evaluates Dynadiff against several baselines, including Brain-Diffuser [ozcelik2023brain], MindEye [scotti2023reconstructing], MindEye2 [scotti2024mindeye2], and WAVE [wang2024wave], on single-trial data. Metrics cover both low-level image similarity (SSIM, PixCorr, AlexNet) and high-level/semantic resemblance (CLIP, Inception, Efficient-Net, SwAV, DreamSim [fu2023dreamsimlearningnewdimensions], mIoU).

Results demonstrate that Dynadiff achieves state-of-the-art performance on fMRI time-series decoding. It outperforms baselines, including the current state-of-the-art MindEye2 (which primarily uses beta values), on several key metrics like AlexNet(2/5), DreamSim, and CLIP-12. This indicates improved reconstruction of both low-level details and high-level semantic content. Qualitative results also show Dynadiff producing reconstructions that better capture object identity and scene composition compared to baselines.

A significant finding is the investigation into time-resolved decoding. By evaluating models on fMRI windows shifted relative to stimulus onset, the paper reveals that while a "General" model (trained on a fixed window) can generalize somewhat to nearby time points and even decode preceding/succeeding images at extreme shifts, "Specialized" models trained specifically for different time windows achieve the best performance at those respective times. This suggests that image representations in fMRI signals are dynamic and change over time, a phenomenon typically associated with higher temporal resolution techniques like M/EEG, but surprisingly observed here with fMRI.

Ablation studies confirm the importance of key design choices:

Using an fMRI time window duration of 3-6 TRs (approx. 3.9-7.8s) is crucial for optimal performance (Figure 5).
The timestep-specific layers and the position of the temporal aggregation layer in the brain module significantly impact performance (Table 2), highlighting the value of processing temporal information.
The strategy for finetuning the diffusion model is critical, with LoRA on cross-attention layers proving most effective (Table 4). Freezing the diffusion model or finetuning larger subsets of parameters leads to worse results.
The specific fMRI preprocessing pipeline impacts results, with the custom NSD preprocessing outperforming a standard fMRIPrep approach (Table 5).
The multi-subject training capability of Dynadiff (sharing parameters across subjects except for subject-specific and timestep-specific layers) performs competitively with single-subject models and demonstrates improved data efficiency when finetuning on new subjects with limited data (Figure 6, Table 6).

Practical Implications and Implementation:

Simplified Architecture: Dynadiff offers a much simpler implementation compared to multi-stage pipelines. A single joint training process for the brain module and LoRA adapters is less complex to manage and optimize.
Direct Timeseries Decoding: The ability to decode directly from continuous fMRI timeseries removes the need for time-collapsing preprocessing steps like GLM beta value computation, which simplifies the overall pipeline and allows for time-resolved analysis.
Leveraging Pretrained Diffusion Models: The approach successfully adapts powerful, off-the-shelf diffusion models for brain decoding, benefiting from their strong generative capabilities without needing to train a generative model from scratch. LoRA finetuning keeps the number of trainable parameters relatively small (Brain Module ~400M + LoRA ~25M), making training feasible.
Hardware Requirements: Training requires significant resources, indicated by the use of 8 A100 GPUs for 2.5 days with a batch size of 320 and DeepSpeed ZeRO stage 2 Offload. This suggests that while the architecture is simpler, the computational cost of training a diffusion model end-to-end remains high. Inference is likely faster, but the model size (~425M trainable parameters + frozen diffusion model) implies substantial memory requirements.
Data Requirements: The model requires a large dataset like NSD (tens of thousands of trials per subject) for optimal performance. While multi-subject pretraining helps, reliably decoding from arbitrary participants or with limited data remains an open challenge.
Preprocessing Sensitivity: The choice of fMRI preprocessing significantly affects results, suggesting that careful consideration and potentially optimization of this step are crucial for practical applications.
Ethical Considerations: The paper highlights the importance of ethical safeguards for brain decoding technologies, such as ensuring participant consent and mitigating risks like face privacy by blurring faces in reconstructions.

In summary, Dynadiff presents a streamlined and effective method for image reconstruction from continuous fMRI signals. By adopting a single-stage training approach that directly fine-tunes a pretrained diffusion model, it simplifies the decoding pipeline while achieving state-of-the-art results on fMRI time-series data. The paper also provides intriguing evidence for dynamic image encoding in fMRI, paving the way for future research into time-resolved brain representations. However, deploying such models in practice requires addressing significant data requirements, computational resources, and ethical considerations.