NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes

Published 2 Oct 2025 in cs.CV and cs.HC | (2510.02266v1)

Abstract: Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain's abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift's CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a dual-adapter framework that enables efficient reconstruction of complex visual scenes from fMRI data.
It uses AutoKL and CLIP Adapters to separately capture low-level spatial details and high-level semantic content, ensuring robust cross-subject generalization.
The framework reduces training complexity by fine-tuning only 17% of parameters, outperforming previous models in spatial and semantic accuracy.

NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes

Introduction

NeuroSwift introduces a hierarchical, dual-adapter framework for reconstructing complex visual scenes from fMRI data, addressing the persistent challenges of cross-subject generalization and computational efficiency in brain-to-image synthesis. The model leverages biologically inspired mechanisms, integrating a structural pathway (AutoKL Adapter) for low-level spatial features and a semantic pathway (CLIP Adapter) for high-level conceptual information. The CLIP Adapter is trained using synthetic semantic images generated by Stable Diffusion and paired COCO captions, emulating the abstract encoding of the human visual cortex. Cross-subject adaptation is achieved by fine-tuning only the fully connected layers (17% of parameters) for new subjects, with all other components frozen, enabling rapid transfer with minimal computational resources.

Figure 1: (a) Subjects imagine semantic content during scanning; COCO Captions and Semantic Images are used to emulate this. (b) Cross-subject adaptation is achieved by fine-tuning only 17% of parameters in one hour on 3×RTX4090 GPUs.

NeuroSwift Architecture

The NeuroSwift framework is composed of two main pipelines: structural generation and semantic reinforcement, followed by a diffusion-based reconstruction process.

Figure 2: NeuroSwift architecture in single-subject mode. fMRI voxels are processed through structural and semantic pipelines, with outputs fused in a guided diffusion process for image reconstruction.

Structural Generation: AutoKL Adapter

The AutoKL Adapter projects fMRI voxels into the latent space of Versatile Diffusion, capturing spatial primitives and color distributions. Training minimizes MSE between predicted and ground-truth latent variables, ensuring alignment with the visual stimulus. The AutoKL Decoder, kept frozen, decodes the final latent vector into the reconstructed image, preserving low-level spatial fidelity.

Semantic Reinforcement: CLIP Adapter

The CLIP Adapter maps fMRI voxels into CLIP image and text embeddings. Semantic images, generated from COCO captions via Stable Diffusion, are used for image embedding alignment, while text embeddings are derived from the captions themselves. Training employs a combination of SoftCLIP contrastive loss and MSE regularization to maximize semantic consistency and mitigate artifacts from noisy fMRI-CLIP mappings.

Diffusion Reconstruction

The reconstruction process initializes from the structural latent variable, with noise added according to a structural strength coefficient. Semantic conditioning is imposed via cross-attention mechanisms in the UNet, integrating both text and image CLIP embeddings at each denoising step. The final latent vector is decoded into pixel space, achieving decoupling of structure and semantics.

Data and ROI Selection

Experiments utilize the NSD dataset, focusing on four subjects with complete sessions. Manually delineated ROI masks are employed instead of standardized templates, accounting for neuroanatomical variability and enhancing spatial precision. This individualized mapping is critical for accurate downstream decoding and reconstruction.

Figure 3: Main cortical regions involved in visual perception, highlighting the hierarchical organization of the visual pathway.

Experimental Results

Image Reconstruction

NeuroSwift demonstrates robust reconstruction of complex scenes, accurately recovering both spatial layout and semantic content. The hierarchical pipeline enables the model to reconstruct fine details such as small objects and cluttered backgrounds, even under cross-subject generalization with only one hour of training data.

Figure 4: NeuroSwift reconstructions from complex visual stimuli, showing high fidelity in both spatial and semantic domains.

Qualitative Comparison

NeuroSwift outperforms MindEye2 and MindTuner in cross-subject adaptation, reconstructing scenes with superior spatial and semantic accuracy. In single-subject mode, it surpasses MindBridge, achieving precise spatial recovery alongside semantic correctness.

Figure 5: NeuroSwift vs. MindEye2 and MindTuner in cross-subject adaptation with 1h training data.

Figure 6: NeuroSwift vs. MindBridge in single-subject mode with 40h training data.

Quantitative Evaluation

NeuroSwift achieves top scores in PixCorr, SSIM, CLIP, and AlexNet metrics, indicating superior performance in both low-level spatial and high-level semantic reconstruction. With 66.3M parameters, it is more efficient than MindEye2 and MindTuner, requiring only 3×RTX4090 GPUs for training.

Ablation Studies

Component ablations reveal the necessity of both structural and semantic pathways. Removing the structural prior degrades spatial fidelity, while omitting semantic guidance collapses semantic metrics. The use of individualized ROI masks further enhances spatial and texture fidelity compared to standardized templates.

Figure 7: Reconstruction examples from Subj01 with various ablations, illustrating the contributions of each model component.

Interpretability

Analysis of adapter weight distributions shows that the AutoKL Adapter predominantly activates early visual regions (V1-V3), consistent with its role in structural generation. The CLIP Adapter exhibits distributed activation across ventral, lateral, and parietal regions, reflecting its multimodal semantic integration.

Figure 8: Spatial distribution of brain-region contributions for AutoKL and CLIP Adapters in Subj01.

Figure 9: Spatial mapping of brain region contributions to AutoKL and CLIP Adapters on Subj05 and Subj07, confirming consistent activation patterns across subjects.

Implications and Future Directions

NeuroSwift's dual-pathway design and efficient cross-subject adaptation represent a significant advance in fMRI-based visual reconstruction. The model's ability to generalize with minimal data and computational resources has practical implications for scalable brain-computer interfaces and neuroimaging research. The use of synthetic semantic images for CLIP alignment is a notable methodological innovation, improving semantic extraction from noisy fMRI signals.

Theoretical implications include the validation of hierarchical, multimodal integration in neural decoding, supporting models of distributed visual processing in the cortex. Future work should address dataset generalizability, extending validation to datasets such as BOLD5000, and explore further optimization of adapter architectures for even greater efficiency and robustness.

Conclusion

NeuroSwift achieves state-of-the-art performance in fMRI visual reconstruction of complex scenes, combining a biologically inspired dual-adapter architecture with efficient cross-subject adaptation. Its design enables high-fidelity reconstruction with minimal computational resources, outperforming existing methods in both spatial and semantic domains. Limitations include dataset specificity, warranting future cross-dataset validation to ensure broader applicability.