Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction (2404.12630v2)

Published 19 Apr 2024 in cs.CV and cs.MM

Abstract: Decoding natural visual scenes from brain activity has flourished, with extensive research in single-subject tasks and, however, less in cross-subject tasks. Reconstructing high-quality images in cross-subject tasks is a challenging problem due to profound individual differences between subjects and the scarcity of data annotation. In this work, we proposed MindTuner for cross-subject visual decoding, which achieves high-quality and rich semantic reconstructions using only 1 hour of fMRI training data benefiting from the phenomena of visual fingerprint in the human visual system and a novel fMRI-to-text alignment paradigm. Firstly, we pre-train a multi-subject model among 7 subjects and fine-tune it with scarce data on new subjects, where LoRAs with Skip-LoRAs are utilized to learn the visual fingerprint. Then, we take the image modality as the intermediate pivot modality to achieve fMRI-to-text alignment, which achieves impressive fMRI-to-text retrieval performance and corrects fMRI-to-image reconstruction with fine-tuned semantics. The results of both qualitative and quantitative analyses demonstrate that MindTuner surpasses state-of-the-art cross-subject visual decoding models on the Natural Scenes Dataset (NSD), whether using training data of 1 hour or 40 hours.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature neuroscience 25, 1 (2022), 116–126.
  2. Local optimal transport for functional brain template estimation. In Information Processing in Medical Imaging: 26th International Conference, IPMI 2019, Hong Kong, China, June 2–7, 2019, Proceedings 26. Springer, 237–248.
  3. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems 33 (2020), 9912–9924.
  4. Shared memories reveal shared structure in neural activity across individuals. Nature neuroscience 20, 1 (2017), 115–125.
  5. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22710–22720.
  6. Decoding visual neural representations by multimodal learning of brain-visual-linguistic features. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
  7. Through their eyes: multi-subject Brain Decoding with simple alignment techniques. arXiv preprint arXiv:2309.00627 (2023).
  8. Functional connectivity in the brain—is it an elusive concept? Neuroscience & Biobehavioral Reviews 28, 8 (2005), 827–836.
  9. Pycortex: an interactive surface visualizer for fMRI. Frontiers in neuroinformatics 9 (2015), 23.
  10. Tomoyasu Horikawa and Yukiyasu Kamitani. 2017. Generic decoding of seen and imagined objects using hierarchical visual features. Nature communications 8, 1 (2017), 15037.
  11. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  12. Learning shared neural manifolds from multi-subject FMRI data. In 2022 IEEE 32nd International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 01–06.
  13. Mixco: Mix-up contrastive learning for visual representation. arXiv preprint arXiv:2010.06300 (2020).
  14. Brain-optimized inference improves reconstructions of fMRI brain activity. arXiv preprint arXiv:2312.07705 (2023).
  15. Mind reader: Reconstructing complex images from brain activities. Advances in Neural Information Processing Systems 35 (2022), 29624–29636.
  16. David Linden. 2021. Section 3 - Introduction. In fMRI Neurofeedback, Michelle Hampson (Ed.). Academic Press, 161–169. https://doi.org/10.1016/B978-0-12-822421-2.00008-9
  17. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
  18. MindDiffuser: Controlled Image Reconstruction from Human Brain Activity with Semantic and Structural Diffusion. In Proceedings of the 31st ACM International Conference on Multimedia. 5899–5908.
  19. Weijian Mai and Zhijun Zhang. 2023. Unibrain: Unify image reconstruction and captioning all in one diffusion model from human brain activity. arXiv preprint arXiv:2308.07428 (2023).
  20. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 4296–4304.
  21. BMI-Net: A Brain-inspired Multimodal Interaction Network for Image Aesthetic Assessment. In Proceedings of the 31st ACM International Conference on Multimedia (, Ottawa ON, Canada,) (MM ’23). Association for Computing Machinery, New York, NY, USA, 5514–5522. https://doi.org/10.1145/3581783.3611996
  22. Furkan Ozcelik and Rufin VanRullen. 2023. Brain-diffuser: Natural scene reconstruction from fmri signals using generative latent diffusion. arXiv preprint arXiv:2303.05334 (2023).
  23. fmri-pte: A large-scale fmri pretrained transformer encoder for multi-subject brain activity decoding. arXiv preprint arXiv:2311.00342 (2023).
  24. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  25. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
  26. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  27. Reconstructing the Mind’s Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors. arXiv preprint arXiv:2305.18274 (2023).
  28. MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data. arXiv preprint arXiv:2403.11207 (2024).
  29. Deep image reconstruction from human brain activity. PLoS computational biology 15, 1 (2019), e1006633.
  30. Leslie N Smith and Nicholay Topin. 2019. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, Vol. 11006. SPIE, 369–386.
  31. Brain-optimized neural networks learn non-hierarchical models of representation in human visual cortex. bioRxiv (2022), 2022–01.
  32. High-dimensional geometry of population responses in visual cortex. Nature 571, 7765 (2019), 361–365.
  33. Yu Takagi and Shinji Nishimoto. 2023. High-resolution image reconstruction with latent diffusion models from human brain activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14453–14463.
  34. Mingxing Tan and V Le Quoc. [n. d.]. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, September 2020. arXiv preprint arXiv:1905.11946 ([n. d.]).
  35. Aligning brain functions boosts the decoding of visual semantics in novel subjects. arXiv preprint arXiv:2312.06467 (2023).
  36. Brain state decoding for rapid image retrieval. In Proceedings of the 17th ACM International Conference on Multimedia (Beijing, China) (MM ’09). Association for Computing Machinery, New York, NY, USA, 945–954. https://doi.org/10.1145/1631272.1631463
  37. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100 (2022).
  38. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.
  39. Idiosyncratic perception: a link between acuity, perceived position and apparent size. Proceedings of the Royal Society B 287, 1930 (2020), 20200825.
  40. Dream: Visual decoding from reversing human visual system. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 8226–8235.
  41. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
  42. Decoding Auditory Saliency from FMRI Brain Imaging. In Proceedings of the 22nd ACM International Conference on Multimedia (Orlando, Florida, USA) (MM ’14). Association for Computing Machinery, New York, NY, USA, 873–876. https://doi.org/10.1145/2647868.2655039
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zixuan Gong (10 papers)
  2. Qi Zhang (785 papers)
  3. Guangyin Bao (8 papers)
  4. Lei Zhu (280 papers)
  5. Ke Liu (597 papers)
  6. Liang Hu (64 papers)
  7. Duoqian Miao (25 papers)
Citations (4)

Summary

An Analysis of "MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction"

The recent exploration of cross-subject visual decoding via functional MRI (fMRI) presents significant strides in overcoming the challenges tied to individual neural variances and limited data annotation. The paper "MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction" introduces a novel approach that leverages the concept of visual fingerprints and a sophisticated fMRI-to-text alignment strategy to significantly enhance visual reconstruction performance.

Summary of Contributions

The authors propose a method called MindTuner, specifically designed to optimize cross-subject visual decoding by adopting a unique blend of visual fingerprint learning and semantic correction. Their method is bifurcated into a dual-phase approach: multi-subject pre-training and new-subject fine-tuning. The multi-subject phase uses shared characteristics across various subjects to establish a robust base model, while the latter integrates subject-specific visual fingerprints to adapt the model in recognition of individual differences.

A distinctive feature of their method involves the deployment of Low-Rank Adaptation (LoRA), enhanced by Skip-LoRA structures, to capture non-linear interactions in fMRI data effectively. This aspect is particularly noteworthy given the traditional challenges associated with overfitting non-linear models due to the low signal-to-noise ratio inherent in fMRI data. Additionally, an innovative 'Pivot' mechanism facilitates the transition from fMRI to text domains by using images as intermediary portals, thereby refining semantic content of the reconstructed imagery.

Key Findings and Results

MindTuner exhibits robust performance in comparison with state-of-the-art methods across comprehensive qualitative and quantitative evaluations. Notably, the method achieves substantial improvements in high-level image fidelity metrics and retrieval accuracies, particularly in scenarios of data paucity, demonstrating its potential for practical application in environments with limited fMRI samples.

Quantitatively, MindTuner's superiority is marked by improvements in retrieval accuracies and various image quality metrics compared to benchmark models such as MindEye2. The deployment of semantic correction techniques further assures enhancements in the reconstructed image's semantic fidelity, thereby circumventing misalignment issues noticeable in prior methodologies.

Implications and Future Directions

The implications of MindTuner extend beyond enhanced visual decoding; they suggest a pathway towards developing universally applicable brain-computer interface models that capitalize on shared neural patterns across subjects. This research can catalyze advancements in neural decoding frameworks, enabling efficient application in real-world settings with constrained data resources.

Looking forward, the paper's consideration of the degrees of non-linearity influencing visual fingerprint acquisition presents avenues for future exploration. An intriguing challenge lies in refining the balance between non-linear modeling complexity and overfitting risk, particularly as it pertains to the diverse neural architectures across subjects.

Conclusion

The proposed MindTuner methodology sets a promising precedent in cross-subject visual decoding by successfully integrating concepts from neuroscience, machine learning, and computational linguistics. Its ability to adapt lightweight fine-tuning mechanisms for subject-specific traits, while maintaining high-quality reconstructions with minimal data, marks a significant contribution to the field. Further research building on these findings may ultimately pave the way for adaptive, generalized brain-computer interface models that efficiently harness the underlying commonalities within human neural responses.

X Twitter Logo Streamline Icon: https://streamlinehq.com