Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Modality-Agnostic fMRI Decoding of Vision and Language (2403.11771v1)

Published 18 Mar 2024 in cs.CV and cs.CL

Abstract: Previous studies have shown that it is possible to map brain activation data of subjects viewing images onto the feature representation space of not only vision models (modality-specific decoding) but also LLMs (cross-modal decoding). In this work, we introduce and use a new large-scale fMRI dataset (~8,500 trials per subject) of people watching both images and text descriptions of such images. This novel dataset enables the development of modality-agnostic decoders: a single decoder that can predict which stimulus a subject is seeing, irrespective of the modality (image or text) in which the stimulus is presented. We train and evaluate such decoders to map brain signals onto stimulus representations from a large range of publicly available vision, language and multimodal (vision+language) models. Our findings reveal that (1) modality-agnostic decoders perform as well as (and sometimes even better than) modality-specific decoders (2) modality-agnostic decoders mapping brain data onto representations from unimodal models perform as well as decoders relying on multimodal representations (3) while language and low-level visual (occipital) brain regions are best at decoding text and image stimuli, respectively, high-level visual (temporal) regions perform well on both stimulus types.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience, 25(1):116–126, January 2022. ISSN 1546-1726. doi: 10.1038/s41593-021-00962-x.
  2. SPM12, 2014. URL https://www.fil.ion.ucl.ac.uk/spm/software/spm12/.
  3. Brain decoding: toward real-time reconstruction of visual perception, October 2023. URL http://arxiv.org/abs/2310.19812. arXiv:2310.19812 [cs, eess, q-bio].
  4. The Brain Atlas Concordance Problem: Quantitative Comparison of Anatomical Parcellations. PLOS ONE, 4(9):e7200, September 2009. ISSN 1932-6203. doi: 10.1371/journal.pone.0007200. URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0007200. Publisher: Public Library of Science.
  5. Multimodal Feature Integration in the Angular Gyrus during Episodic and Semantic Retrieval. The Journal of Neuroscience, 36(20):5462–5471, May 2016. ISSN 0270-6474, 1529-2401. doi: 10.1523/JNEUROSCI.4310-15.2016. URL https://www.jneurosci.org/lookup/doi/10.1523/JNEUROSCI.4310-15.2016.
  6. BOLD5000, a public fMRI dataset while viewing 5000 visual images. Scientific Data, 6(1):49, May 2019. ISSN 2052-4463. doi: 10.1038/s41597-019-0052-3. URL https://www.nature.com/articles/s41597-019-0052-3. Number: 1 Publisher: Nature Publishing Group.
  7. The Algonauts Project 2021 Challenge, 2021.
  8. Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage, 53(1):1–15, October 2010. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2010.06.010. URL https://www.sciencedirect.com/science/article/pii/S1053811910008542.
  9. Representational Similarity Analysis Reveals Commonalities and Differences in the Semantic Processing of Words and Objects. The Journal of Neuroscience, 33(48):18906–18916, November 2013. ISSN 0270-6474. doi: 10.1523/JNEUROSCI.3809-13.2013. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3852350/.
  10. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  11. Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain, November 2023. URL http://arxiv.org/abs/2311.07766. arXiv:2311.07766 [cs].
  12. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, 2020.
  13. 3D statistical neuroanatomical models from 305 MRI volumes. In 1993 IEEE Conference Record Nuclear Science Symposium and Medical Imaging Conference, pp.  1813–1817 vol.3, October 1993. doi: 10.1109/NSSMIC.1993.373602. URL https://ieeexplore.ieee.org/abstract/document/373602.
  14. S. L. Fairhall and A. Caramazza. Brain Regions That Represent Amodal Conceptual Knowledge. Journal of Neuroscience, 33(25):10552–10558, June 2013. ISSN 0270-6474, 1529-2401. doi: 10.1523/JNEUROSCI.0051-13.2013. URL https://www.jneurosci.org/lookup/doi/10.1523/JNEUROSCI.0051-13.2013.
  15. New Method for fMRI Investigations of Language: Defining ROIs Functionally in Individual Subjects. Journal of Neurophysiology, 104(2):1177–1194, August 2010. ISSN 0022-3077, 1522-1598. doi: 10.1152/jn.00032.2010. URL https://www.physiology.org/doi/10.1152/jn.00032.2010.
  16. Multimodal decoding of human brain activity into images and text. In UniReps: the First Workshop on Unifying Representations in Neural Models. arXiv, 2023. URL http://arxiv.org/abs/2305.11560. arXiv:2305.11560 [cs].
  17. Bruce Fischl. FreeSurfer. NeuroImage, 62(2):774–781, 2012. ISSN 1095-9572. doi: 10.1016/j.neuroimage.2012.01.021. Place: Netherlands Publisher: Elsevier Science.
  18. ImageBind: One Embedding Space To Bind Them All. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15180–15190, 2023. URL https://openaccess.thecvf.com/content/CVPR2023/html/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.html.
  19. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  770–778, 2016. URL https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html.
  20. A neural decoding algorithm that generates language from visual activity evoked by natural images. Neural Networks, 144:90–100, December 2021. ISSN 0893-6080. doi: 10.1016/j.neunet.2021.08.006. URL https://www.sciencedirect.com/science/article/pii/S0893608021003117.
  21. A Continuous Semantic Space Describes the Representation of Thousands of Object and Action Categories across the Human Brain. Neuron, 76(6):1210–1224, December 2012. ISSN 08966273. doi: 10.1016/j.neuron.2012.10.014. URL https://linkinghub.elsevier.com/retrieve/pii/S0896627312009348.
  22. Mistral 7B, October 2023. URL http://arxiv.org/abs/2310.06825. arXiv:2310.06825 [cs].
  23. Mapping Brains with Language Models: A Survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  9748–9762, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.618. URL https://aclanthology.org/2023.findings-acl.618.
  24. Cortical response to naturalistic stimuli is largely predictable with deep neural networks. Science Advances, 7(22):eabe7547, May 2021. doi: 10.1126/sciadv.abe7547. URL https://www.science.org/doi/10.1126/sciadv.abe7547. Publisher: American Association for the Advancement of Science.
  25. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning, pp.  5583–5594. PMLR, July 2021. URL https://proceedings.mlr.press/v139/kim21k.html. ISSN: 2640-3498.
  26. Empirical Linguistic Study of Sentence Embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  5729–5739, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1573. URL https://www.aclweb.org/anthology/P19-1573.
  27. Interpreting encoding and decoding models. Current Opinion in Neurobiology, 55:167–179, April 2019. ISSN 0959-4388. doi: 10.1016/j.conb.2019.04.002. URL https://www.sciencedirect.com/science/article/pii/S0959438818301004.
  28. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv:1908.03557 [cs], August 2019. URL http://arxiv.org/abs/1908.03557. arXiv: 1908.03557.
  29. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, volume 8693, pp.  740–755. Springer International Publishing, Cham, 2014. doi: 10.1007/978-3-319-10602-1˙48. URL http://link.springer.com/10.1007/978-3-319-10602-1_48. Series Title: Lecture Notes in Computer Science.
  30. RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019. URL https://arxiv.org/abs/1907.11692.
  31. Describing Semantic Representations of Brain Activity Evoked by Visual Stimuli. In 31st Conference on Neural Information Processing Systems. arXiv, 2017. URL http://arxiv.org/abs/1802.02210. arXiv:1802.02210 [cs].
  32. Bayesian Reconstruction of Natural Images from Human Brain Activity. Neuron, 63(6):902–915, September 2009. ISSN 08966273. doi: 10.1016/j.neuron.2009.09.006. URL https://linkinghub.elsevier.com/retrieve/pii/S0896627309006850.
  33. Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies. Current Biology, 21(19):1641–1646, October 2011. ISSN 0960-9822. doi: 10.1016/j.cub.2011.08.031. URL https://www.cell.com/current-biology/abstract/S0960-9822(11)00937-7. Publisher: Elsevier.
  34. DINOv2: Learning Robust Visual Features without Supervision, April 2023. URL http://arxiv.org/abs/2304.07193. arXiv:2304.07193 [cs].
  35. Natural scene reconstruction from fMRI signals using generative latent diffusion, June 2023. URL http://arxiv.org/abs/2303.05334. arXiv:2303.05334 [cs, q-bio].
  36. Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12:2825–2830, 2011.
  37. Toward a universal decoder of linguistic meaning from brain activation. Nature Communications, 9(1):963, March 2018. ISSN 2041-1723. doi: 10.1038/s41467-018-03068-4. URL https://www.nature.com/articles/s41467-018-03068-4. Number: 1 Publisher: Nature Publishing Group.
  38. Visual and linguistic semantic representations are aligned at the border of human visual cortex. Nature Neuroscience, 24(11):1628–1636, November 2021. ISSN 1097-6256, 1546-1726. doi: 10.1038/s41593-021-00921-6. URL https://www.nature.com/articles/s41593-021-00921-6.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. URL https://insightcivic.s3.us-east-1.amazonaws.com/language-models.pdf.
  40. Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning, pp.  16, 2021.
  41. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3980–3990, Hong Kong, China, 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://www.aclweb.org/anthology/D19-1410.
  42. There is no single functional atlas even for a single individual: Functional parcel definitions change with task. NeuroImage, 208:116366, March 2020. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2019.116366. URL https://www.sciencedirect.com/science/article/pii/S1053811919309577.
  43. Reconstructing the Mind’s Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors. In NeurIPS, 2023.
  44. FLAVA: A Foundational Language and Vision Alignment Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15638–15650, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/Singh_FLAVA_A_Foundational_Language_and_Vision_Alignment_Model_CVPR_2022_paper.html.
  45. Generation of Viewed Image Captions From Human Brain Activity Via Unsupervised Text Latent Space. In 2020 IEEE International Conference on Image Processing (ICIP), pp.  2521–2525, October 2020. doi: 10.1109/ICIP40778.2020.9191262. URL https://ieeexplore.ieee.org/abstract/document/9191262. ISSN: 2381-8549.
  46. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  5100–5111, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1514. URL https://aclanthology.org/D19-1514.
  47. Brain encoding models based on multimodal transformers can transfer across language and vision. In Thirty-seventh Conference on Neural Information Processing Systems. arXiv, 2023a. URL https://openreview.net/forum?id=UPefaFqjNQ. arXiv:2305.12248 [cs].
  48. Semantic reconstruction of continuous language from non-invasive brain recordings. Nature Neuroscience, 26(5):858–866, May 2023b. ISSN 1546-1726. doi: 10.1038/s41593-023-01304-9. URL https://www.nature.com/articles/s41593-023-01304-9. Number: 5 Publisher: Nature Publishing Group.
  49. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. URL http://arxiv.org/abs/2307.09288.
  50. Reconstructing faces from fMRI patterns using deep generative neural networks. Communications Biology, 2(1):1–10, May 2019. ISSN 2399-3642. doi: 10.1038/s42003-019-0438-y. URL https://www.nature.com/articles/s42003-019-0438-y. Number: 1 Publisher: Nature Publishing Group.
  51. Natural language supervision with a large and diverse dataset builds better models of human high-level visual cortex, July 2023. URL https://www.biorxiv.org/content/10.1101/2022.09.27.508760v2. Pages: 2022.09.27.508760 Section: New Results.
  52. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp.  38–45, 2020.
  53. Weihao Xia. DREAM: Visual Decoding From Reversing Human Visual System. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024.
  54. BridgeTower: Building Bridges between Encoders in Vision-Language Representation Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 37(9):10637–10647, June 2023. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v37i9.26263. URL https://ojs.aaai.org/index.php/AAAI/article/view/26263.
  55. Attention during natural vision warps semantic representation across the human brain. Nature Neuroscience, 16(6):763–770, June 2013. ISSN 1546-1726. doi: 10.1038/nn.3381. URL https://www.nature.com/articles/nn.3381. Number: 6 Publisher: Nature Publishing Group.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Mitja Nikolaus (7 papers)
  2. Milad Mozafari (10 papers)
  3. Nicholas Asher (26 papers)
  4. Leila Reddy (5 papers)
  5. Rufin VanRullen (32 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com