Lite-Mind: Towards Efficient and Robust Brain Representation Network (2312.03781v4)
Abstract: The limited data availability and the low signal-to-noise ratio of fMRI signals lead to the challenging task of fMRI-to-image retrieval. State-of-the-art MindEye remarkably improves fMRI-to-image retrieval performance by leveraging a large model, i.e., a 996M MLP Backbone per subject, to align fMRI embeddings to the final hidden layer of CLIP's Vision Transformer (ViT). However, significant individual variations exist among subjects, even under identical experimental setups, mandating the training of large subject-specific models. The substantial parameters pose significant challenges in deploying fMRI decoding on practical devices. To this end, we propose Lite-Mind, a lightweight, efficient, and robust brain representation learning paradigm based on Discrete Fourier Transform (DFT), which efficiently aligns fMRI voxels to fine-grained information of CLIP. We elaborately design a DFT backbone with Spectrum Compression and Frequency Projector modules to learn informative and robust voxel embeddings. Our experiments demonstrate that Lite-Mind achieves an impressive 94.6% fMRI-to-image retrieval accuracy on the NSD dataset for Subject 1, with 98.7% fewer parameters than MindEye. Lite-Mind is also proven to be able to be migrated to smaller fMRI datasets and establishes a new state-of-the-art for zero-shot classification on the GOD dataset.
- A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence. Nature neuroscience, 25(1):116–126, 2022.
- Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them, 2022.
- Spectral temporal graph neural network for multivariate time-series forecasting. Advances in neural information processing systems, 33:17766–17778, 2020.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
- Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22710–22720, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Decoding visual neural representations by multimodal learning of brain-visual-linguistic features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Deep residual learning in the jpeg transform domain. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3484–3493, 2019.
- Decoding word embeddings with brain-based semantic features. Computational Linguistics, 47(3):663–698, 2021.
- Decoding natural image stimuli from fmri data with a surface-based convolutional network. arXiv preprint arXiv:2212.02409, 2022.
- Faster neural networks straight from jpeg. Advances in Neural Information Processing Systems, 31, 2018.
- Adaptive fourier neural operators: Efficient token mixers for transformers. arXiv preprint arXiv:2111.13587, 2021.
- Generic decoding of seen and imagined objects using hierarchical visual features. Nature communications, 8(1):15037, 2017.
- Decoding the visual and subjective contents of the human brain. Nature neuroscience, 8(5):679–685, 2005.
- Evidence of human-like visual-linguistic integration in multimodal large language models during predictive language processing. arXiv preprint arXiv:2308.06035, 2023.
- Fractional fourier transform in time series prediction. IEEE Signal Processing Letters, 29:2542–2546, 2022.
- From fourier to koopman: Spectral methods for long-term time series prediction. The Journal of Machine Learning Research, 22(1):1881–1918, 2021.
- Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824, 2021.
- Mind reader: Reconstructing complex images from brain activities. Advances in Neural Information Processing Systems, 35:29624–29636, 2022.
- David Linden. Section 3 - introduction. In fMRI Neurofeedback, pages 161–169. Academic Press, 2021.
- Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding from fmri. arXiv preprint arXiv:2302.12971, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Unibrain: Unify image reconstruction and captioning all in one diffusion model from human brain activity. arXiv preprint arXiv:2308.07428, 2023.
- Encoding and decoding in fmri. Neuroimage, 56(2):400–410, 2011.
- Brain-diffuser: Natural scene reconstruction from fmri signals using generative latent diffusion. arXiv preprint arXiv:2303.05334, 2023.
- Decoding brain representations by multimodal learning of neural activity and visual features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):3833–3849, 2020.
- Toward a universal decoder of linguistic meaning from brain activation. Nature communications, 9(1):963, 2018.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Global filter networks for image classification. Advances in neural information processing systems, 34:980–993, 2021.
- Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8247–8255, 2019.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Reconstructing the mind’s eye: fmri-to-image with contrastive learning and diffusion priors. arXiv preprint arXiv:2305.18274, 2023.
- Deep image reconstruction from human brain activity. PLoS computational biology, 15(1):e1006633, 2019.
- Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in neural information processing systems, 32, 2019.
- High-resolution image reconstruction with latent diffusion models from human brain activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023.
- Efficientnet: Rethinking model scaling for convolutional neural networks, september 2020. arXiv preprint arXiv:1905.11946.
- Frequency-domain mlps are more effective learners in time series forecasting. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- Multimodal generative models for scalable weakly-supervised learning. Advances in neural information processing systems, 31, 2018.
- Learning in the frequency domain. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1740–1749, 2020.
- Unsupervised time-series representation learning with iterative bilinear temporal-spectral fusion. In International Conference on Machine Learning, pages 25038–25054. PMLR, 2022.
- Fda: Fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4085–4095, 2020.
- Cross-modal cloze task: A new task to brain-to-word decoding. In Findings of the Association for Computational Linguistics: ACL 2022, pages 648–657, 2022.
- Zixuan Gong (10 papers)
- Qi Zhang (785 papers)
- Duoqian Miao (25 papers)
- Guangyin Bao (8 papers)
- Liang Hu (64 papers)
- Lei Zhu (280 papers)
- Yu Zhang (1400 papers)
- Ke Liu (597 papers)