Modality-Agnostic fMRI Decoding of Vision and Language (2403.11771v1)
Abstract: Previous studies have shown that it is possible to map brain activation data of subjects viewing images onto the feature representation space of not only vision models (modality-specific decoding) but also LLMs (cross-modal decoding). In this work, we introduce and use a new large-scale fMRI dataset (~8,500 trials per subject) of people watching both images and text descriptions of such images. This novel dataset enables the development of modality-agnostic decoders: a single decoder that can predict which stimulus a subject is seeing, irrespective of the modality (image or text) in which the stimulus is presented. We train and evaluate such decoders to map brain signals onto stimulus representations from a large range of publicly available vision, language and multimodal (vision+language) models. Our findings reveal that (1) modality-agnostic decoders perform as well as (and sometimes even better than) modality-specific decoders (2) modality-agnostic decoders mapping brain data onto representations from unimodal models perform as well as decoders relying on multimodal representations (3) while language and low-level visual (occipital) brain regions are best at decoding text and image stimuli, respectively, high-level visual (temporal) regions perform well on both stimulus types.
- A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience, 25(1):116–126, January 2022. ISSN 1546-1726. doi: 10.1038/s41593-021-00962-x.
- SPM12, 2014. URL https://www.fil.ion.ucl.ac.uk/spm/software/spm12/.
- Brain decoding: toward real-time reconstruction of visual perception, October 2023. URL http://arxiv.org/abs/2310.19812. arXiv:2310.19812 [cs, eess, q-bio].
- The Brain Atlas Concordance Problem: Quantitative Comparison of Anatomical Parcellations. PLOS ONE, 4(9):e7200, September 2009. ISSN 1932-6203. doi: 10.1371/journal.pone.0007200. URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0007200. Publisher: Public Library of Science.
- Multimodal Feature Integration in the Angular Gyrus during Episodic and Semantic Retrieval. The Journal of Neuroscience, 36(20):5462–5471, May 2016. ISSN 0270-6474, 1529-2401. doi: 10.1523/JNEUROSCI.4310-15.2016. URL https://www.jneurosci.org/lookup/doi/10.1523/JNEUROSCI.4310-15.2016.
- BOLD5000, a public fMRI dataset while viewing 5000 visual images. Scientific Data, 6(1):49, May 2019. ISSN 2052-4463. doi: 10.1038/s41597-019-0052-3. URL https://www.nature.com/articles/s41597-019-0052-3. Number: 1 Publisher: Nature Publishing Group.
- The Algonauts Project 2021 Challenge, 2021.
- Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage, 53(1):1–15, October 2010. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2010.06.010. URL https://www.sciencedirect.com/science/article/pii/S1053811910008542.
- Representational Similarity Analysis Reveals Commonalities and Differences in the Semantic Processing of Words and Objects. The Journal of Neuroscience, 33(48):18906–18916, November 2013. ISSN 0270-6474. doi: 10.1523/JNEUROSCI.3809-13.2013. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3852350/.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain, November 2023. URL http://arxiv.org/abs/2311.07766. arXiv:2311.07766 [cs].
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, 2020.
- 3D statistical neuroanatomical models from 305 MRI volumes. In 1993 IEEE Conference Record Nuclear Science Symposium and Medical Imaging Conference, pp. 1813–1817 vol.3, October 1993. doi: 10.1109/NSSMIC.1993.373602. URL https://ieeexplore.ieee.org/abstract/document/373602.
- S. L. Fairhall and A. Caramazza. Brain Regions That Represent Amodal Conceptual Knowledge. Journal of Neuroscience, 33(25):10552–10558, June 2013. ISSN 0270-6474, 1529-2401. doi: 10.1523/JNEUROSCI.0051-13.2013. URL https://www.jneurosci.org/lookup/doi/10.1523/JNEUROSCI.0051-13.2013.
- New Method for fMRI Investigations of Language: Defining ROIs Functionally in Individual Subjects. Journal of Neurophysiology, 104(2):1177–1194, August 2010. ISSN 0022-3077, 1522-1598. doi: 10.1152/jn.00032.2010. URL https://www.physiology.org/doi/10.1152/jn.00032.2010.
- Multimodal decoding of human brain activity into images and text. In UniReps: the First Workshop on Unifying Representations in Neural Models. arXiv, 2023. URL http://arxiv.org/abs/2305.11560. arXiv:2305.11560 [cs].
- Bruce Fischl. FreeSurfer. NeuroImage, 62(2):774–781, 2012. ISSN 1095-9572. doi: 10.1016/j.neuroimage.2012.01.021. Place: Netherlands Publisher: Elsevier Science.
- ImageBind: One Embedding Space To Bind Them All. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190, 2023. URL https://openaccess.thecvf.com/content/CVPR2023/html/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.html.
- Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016. URL https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html.
- A neural decoding algorithm that generates language from visual activity evoked by natural images. Neural Networks, 144:90–100, December 2021. ISSN 0893-6080. doi: 10.1016/j.neunet.2021.08.006. URL https://www.sciencedirect.com/science/article/pii/S0893608021003117.
- A Continuous Semantic Space Describes the Representation of Thousands of Object and Action Categories across the Human Brain. Neuron, 76(6):1210–1224, December 2012. ISSN 08966273. doi: 10.1016/j.neuron.2012.10.014. URL https://linkinghub.elsevier.com/retrieve/pii/S0896627312009348.
- Mistral 7B, October 2023. URL http://arxiv.org/abs/2310.06825. arXiv:2310.06825 [cs].
- Mapping Brains with Language Models: A Survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp. 9748–9762, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.618. URL https://aclanthology.org/2023.findings-acl.618.
- Cortical response to naturalistic stimuli is largely predictable with deep neural networks. Science Advances, 7(22):eabe7547, May 2021. doi: 10.1126/sciadv.abe7547. URL https://www.science.org/doi/10.1126/sciadv.abe7547. Publisher: American Association for the Advancement of Science.
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 5583–5594. PMLR, July 2021. URL https://proceedings.mlr.press/v139/kim21k.html. ISSN: 2640-3498.
- Empirical Linguistic Study of Sentence Embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5729–5739, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1573. URL https://www.aclweb.org/anthology/P19-1573.
- Interpreting encoding and decoding models. Current Opinion in Neurobiology, 55:167–179, April 2019. ISSN 0959-4388. doi: 10.1016/j.conb.2019.04.002. URL https://www.sciencedirect.com/science/article/pii/S0959438818301004.
- VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv:1908.03557 [cs], August 2019. URL http://arxiv.org/abs/1908.03557. arXiv: 1908.03557.
- Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, volume 8693, pp. 740–755. Springer International Publishing, Cham, 2014. doi: 10.1007/978-3-319-10602-1˙48. URL http://link.springer.com/10.1007/978-3-319-10602-1_48. Series Title: Lecture Notes in Computer Science.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019. URL https://arxiv.org/abs/1907.11692.
- Describing Semantic Representations of Brain Activity Evoked by Visual Stimuli. In 31st Conference on Neural Information Processing Systems. arXiv, 2017. URL http://arxiv.org/abs/1802.02210. arXiv:1802.02210 [cs].
- Bayesian Reconstruction of Natural Images from Human Brain Activity. Neuron, 63(6):902–915, September 2009. ISSN 08966273. doi: 10.1016/j.neuron.2009.09.006. URL https://linkinghub.elsevier.com/retrieve/pii/S0896627309006850.
- Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies. Current Biology, 21(19):1641–1646, October 2011. ISSN 0960-9822. doi: 10.1016/j.cub.2011.08.031. URL https://www.cell.com/current-biology/abstract/S0960-9822(11)00937-7. Publisher: Elsevier.
- DINOv2: Learning Robust Visual Features without Supervision, April 2023. URL http://arxiv.org/abs/2304.07193. arXiv:2304.07193 [cs].
- Natural scene reconstruction from fMRI signals using generative latent diffusion, June 2023. URL http://arxiv.org/abs/2303.05334. arXiv:2303.05334 [cs, q-bio].
- Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12:2825–2830, 2011.
- Toward a universal decoder of linguistic meaning from brain activation. Nature Communications, 9(1):963, March 2018. ISSN 2041-1723. doi: 10.1038/s41467-018-03068-4. URL https://www.nature.com/articles/s41467-018-03068-4. Number: 1 Publisher: Nature Publishing Group.
- Visual and linguistic semantic representations are aligned at the border of human visual cortex. Nature Neuroscience, 24(11):1628–1636, November 2021. ISSN 1097-6256, 1546-1726. doi: 10.1038/s41593-021-00921-6. URL https://www.nature.com/articles/s41593-021-00921-6.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. URL https://insightcivic.s3.us-east-1.amazonaws.com/language-models.pdf.
- Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning, pp. 16, 2021.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3980–3990, Hong Kong, China, 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://www.aclweb.org/anthology/D19-1410.
- There is no single functional atlas even for a single individual: Functional parcel definitions change with task. NeuroImage, 208:116366, March 2020. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2019.116366. URL https://www.sciencedirect.com/science/article/pii/S1053811919309577.
- Reconstructing the Mind’s Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors. In NeurIPS, 2023.
- FLAVA: A Foundational Language and Vision Alignment Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15638–15650, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/Singh_FLAVA_A_Foundational_Language_and_Vision_Alignment_Model_CVPR_2022_paper.html.
- Generation of Viewed Image Captions From Human Brain Activity Via Unsupervised Text Latent Space. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 2521–2525, October 2020. doi: 10.1109/ICIP40778.2020.9191262. URL https://ieeexplore.ieee.org/abstract/document/9191262. ISSN: 2381-8549.
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5100–5111, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1514. URL https://aclanthology.org/D19-1514.
- Brain encoding models based on multimodal transformers can transfer across language and vision. In Thirty-seventh Conference on Neural Information Processing Systems. arXiv, 2023a. URL https://openreview.net/forum?id=UPefaFqjNQ. arXiv:2305.12248 [cs].
- Semantic reconstruction of continuous language from non-invasive brain recordings. Nature Neuroscience, 26(5):858–866, May 2023b. ISSN 1546-1726. doi: 10.1038/s41593-023-01304-9. URL https://www.nature.com/articles/s41593-023-01304-9. Number: 5 Publisher: Nature Publishing Group.
- Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. URL http://arxiv.org/abs/2307.09288.
- Reconstructing faces from fMRI patterns using deep generative neural networks. Communications Biology, 2(1):1–10, May 2019. ISSN 2399-3642. doi: 10.1038/s42003-019-0438-y. URL https://www.nature.com/articles/s42003-019-0438-y. Number: 1 Publisher: Nature Publishing Group.
- Natural language supervision with a large and diverse dataset builds better models of human high-level visual cortex, July 2023. URL https://www.biorxiv.org/content/10.1101/2022.09.27.508760v2. Pages: 2022.09.27.508760 Section: New Results.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38–45, 2020.
- Weihao Xia. DREAM: Visual Decoding From Reversing Human Visual System. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024.
- BridgeTower: Building Bridges between Encoders in Vision-Language Representation Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 37(9):10637–10647, June 2023. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v37i9.26263. URL https://ojs.aaai.org/index.php/AAAI/article/view/26263.
- Attention during natural vision warps semantic representation across the human brain. Nature Neuroscience, 16(6):763–770, June 2013. ISSN 1546-1726. doi: 10.1038/nn.3381. URL https://www.nature.com/articles/nn.3381. Number: 6 Publisher: Nature Publishing Group.
- Mitja Nikolaus (7 papers)
- Milad Mozafari (10 papers)
- Nicholas Asher (26 papers)
- Leila Reddy (5 papers)
- Rufin VanRullen (32 papers)