BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity (2310.04420v3)
Abstract: Understanding the functional organization of higher visual cortex is a central focus in neuroscience. Past studies have primarily mapped the visual and semantic selectivity of neural populations using hand-selected stimuli, which may potentially bias results towards pre-existing hypotheses of visual cortex functionality. Moving beyond conventional approaches, we introduce a data-driven method that generates natural language descriptions for images predicted to maximally activate individual voxels of interest. Our method -- Semantic Captioning Using Brain Alignments ("BrainSCUBA") -- builds upon the rich embedding space learned by a contrastive vision-LLM and utilizes a pre-trained LLM to generate interpretable captions. We validate our method through fine-grained voxel-level captioning across higher-order visual regions. We further perform text-conditioned image synthesis with the captions, and show that our images are semantically coherent and yield high predicted activations. Finally, to demonstrate how our method enables scientific discovery, we perform exploratory investigations on the distribution of "person" representations in the brain, and discover fine-grained semantic selectivity in body-selective areas. Unlike earlier studies that decode text, our method derives voxel-wise captions of semantic selectivity. Our results show that BrainSCUBA is a promising means for understanding functional preferences in the brain, and provides motivation for further hypothesis-driven investigation of visual cortex.
- A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence. Nature neuroscience, 25(1):116–126, 2022.
- Neural population control via deep image synthesis. Science, 364(6439):eaav9436, 2019.
- Exemplary natural images explain cnn activations better than state-of-the-art feature visualization. arXiv preprint arXiv:2010.12606, 2020.
- The precuneus: a review of its functional anatomy and behavioural correlates. Brain, 129(3):564–583, 2006.
- Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. arXiv preprint arXiv:2211.06956, 1(2):4, 2022.
- Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22710–22720, 2023.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829, 2023.
- The visual word form area: spatial and temporal characterization of an initial stage of reading in normal subjects and posterior split-brain patients. Brain, 123(2):291–307, 2000.
- What can 1.8 billion regressions tell us about the pressures shaping high-level visual representation in brains and machines? bioRxiv, 2023. doi: 10.1101/2022.03.28.485868.
- The representation of semantic information across human cerebral cortex during listening versus reading is invariant to stimulus modality. Journal of Neuroscience, 39(39):7722–7736, 2019.
- Stimulus-selective properties of inferior temporal neurons in the macaque. Journal of Neuroscience, 4(8):2051–2062, 1984.
- Semantic scene descriptions as an objective of human vision. arXiv preprint arXiv:2209.11737, 2022.
- A cortical area selective for visual processing of the human body. Science, 293(5539):2470–2473, 2001.
- Seeing it all: Convolutional network layers map the function of the human visual system. NeuroImage, 152:184–194, 2017.
- A cortical representation of the local visual environment. Nature, 392(6676):598–601, 1998.
- A natural approach to studying vision. Nature neuroscience, 8(12):1643–1646, 2005.
- Brain captioning: Decoding human brain activity into images and text. arXiv preprint arXiv:2305.11560, 2023.
- Neural activity in areas v1, v2 and v4 during free viewing of natural scenes compared to controlled viewing. Neuroreport, 9(7):1673–1678, 1998.
- Unified concept editing in diffusion models. IEEE/CVF Winter Conference on Applications of Computer Vision, 2024.
- Can face recognition really be dissociated from object recognition? Journal of cognitive neuroscience, 11(4):349–370, 1999.
- A multi-modal parcellation of human cerebral cortex. Nature, 536(7615):171–178, 2016.
- Kalanit Grill-Spector. The neural basis of object perception. Current opinion in neurobiology, 13(2):159–166, 2003.
- NeuroGen: activation optimized image synthesis for discovery neuroscience. NeuroImage, 247:118812, 2022.
- Human brain responses are modulated when exposed to optimized natural images or synthetically generated images. Communications Biology, 6(1):1076, 2023.
- Variational autoencoder: An unsupervised model for encoding and decoding fmri activity in visual cortex. NeuroImage, 198:125–136, 2019.
- A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron, 76(6):1210–1224, 2012.
- Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600):453–458, 2016.
- The inferior parietal lobule and temporoparietal junction: a network perspective. Neuropsychologia, 105:70–83, 2017.
- Selectivity for food in human ventral visual cortex. Communications Biology 2023 6:1, 6:1–14, 2 2023. ISSN 2399-3642. doi: 10.1038/s42003-023-04546-2.
- Decoding the visual and subjective contents of the human brain. Nature neuroscience, 8(5):679–685, 2005.
- The fusiform face area: a module in human extrastriate cortex specialized for face perception. Journal of neuroscience, 17(11):4302–4311, 1997.
- High-level visual areas act like domain-general filters with strong selectivity and functional specialization. bioRxiv, pp. 2022–03, 2022.
- A highly selective response to food in human visual cortex revealed by hypothesis-free voxel decomposition. Current Biology, 32:1–13, 2022.
- Brain-like object recognition with high-performing shallow recurrent anns. Advances in neural information processing systems, 32, 2019.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Decap: Decoding clip latents for zero-shot captioning via text-only training. arXiv preprint arXiv:2303.03032, 2023b.
- Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding from fmri. arXiv preprint arXiv:2302.12971, 2023.
- Minddiffuser: Controlled image reconstruction from human brain activity with semantic and structural diffusion. arXiv preprint arXiv:2303.14139, 2023.
- Brain diffusion for visual exploration: Cortical discovery using large scale generative models. arXiv preprint arXiv:2306.03089, 2023.
- Eleanor Maguire. The retrosplenial contribution to human navigation: a review of lesion and neuroimaging findings. Scandinavian journal of psychology, 42(3):225–238, 2001.
- Unibrain: Unify image reconstruction and captioning all in one diffusion model from human brain activity. arXiv preprint arXiv:2308.07428, 2023.
- Face-specific processing in the human fusiform gyrus. Journal of cognitive neuroscience, 9(5):605–610, 1997.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- The “visual word form area” is involved in successful memory encoding of both words and faces. Neuroimage, 52(1):371–378, 2010.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- Encoding and decoding in fmri. Neuroimage, 56(2):400–410, 2011.
- Brain-diffuser: Natural scene reconstruction from fmri signals using generative latent diffusion. arXiv preprint arXiv:2303.05334, 2023.
- Color-biased regions in the ventral visual pathway are food selective. Current Biology, 33(1):134–146, 2023.
- Look at those two!: The precuneus role in unattended third-person perspective of social interactions. Human Brain Mapping, 35(10):5190–5203, 2014.
- Evidence for a third visual pathway specialized for social perception. Trends in Cognitive Sciences, 25:100–110, 2 2021. ISSN 1364-6613. doi: 10.1016/J.TICS.2020.11.006.
- Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences. Cell, 177(4):999–1009, 2019.
- Visual and linguistic semantic representations are aligned at the border of human visual cortex. Nature neuroscience, 24(11):1628–1636, 2021.
- Improving the accuracy of single-trial fmri response estimates using glmsingle. eLife, 11:e77599, nov 2022. ISSN 2050-084X. doi: 10.7554/eLife.77599.
- Differential sensitivity of human visual cortex to faces, letterstrings, and textures: a functional magnetic resonance imaging study. Journal of neuroscience, 16(16):5205–5215, 1996.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Computational models of category-selective brain regions enable high-throughput tests of selectivity. Nature communications, 12(1):5540, 2021.
- Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learning. NeuroImage, 228:117602, 2021.
- People thinking about thinking people: the role of the temporo-parietal junction in “theory of mind”. In Social neuroscience, pp. 171–182. Psychology Press, 2013.
- Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage, 181:775–785, 2018.
- Deep image reconstruction from human brain activity. PLoS computational biology, 15(1):e1006633, 2019.
- How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. bioRxiv, pp. 2022–11, 2022.
- Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17918–17928, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Inception loops discover what excites neurons most using deep predictive models. Nature neuroscience, 22(12):2060–2065, 2019.
- Incorporating natural language into vision models improves prediction and understanding of higher visual cortex. BioRxiv, pp. 2022–09, 2022.
- Deep residual network predicts cortical representation and organization of visual features for rapid categorization. Scientific reports, 8(1):3752, 2018.
- Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the national academy of sciences, 111(23):8619–8624, 2014.
- How well do feature visualizations support causal understanding of cnn activations? Advances in Neural Information Processing Systems, 34:11730–11744, 2021.
- Andrew F. Luo (7 papers)
- Margaret M. Henderson (6 papers)
- Michael J. Tarr (20 papers)
- Leila Wehbe (15 papers)