Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity (2310.04420v3)

Published 6 Oct 2023 in cs.LG and q-bio.NC

Abstract: Understanding the functional organization of higher visual cortex is a central focus in neuroscience. Past studies have primarily mapped the visual and semantic selectivity of neural populations using hand-selected stimuli, which may potentially bias results towards pre-existing hypotheses of visual cortex functionality. Moving beyond conventional approaches, we introduce a data-driven method that generates natural language descriptions for images predicted to maximally activate individual voxels of interest. Our method -- Semantic Captioning Using Brain Alignments ("BrainSCUBA") -- builds upon the rich embedding space learned by a contrastive vision-LLM and utilizes a pre-trained LLM to generate interpretable captions. We validate our method through fine-grained voxel-level captioning across higher-order visual regions. We further perform text-conditioned image synthesis with the captions, and show that our images are semantically coherent and yield high predicted activations. Finally, to demonstrate how our method enables scientific discovery, we perform exploratory investigations on the distribution of "person" representations in the brain, and discover fine-grained semantic selectivity in body-selective areas. Unlike earlier studies that decode text, our method derives voxel-wise captions of semantic selectivity. Our results show that BrainSCUBA is a promising means for understanding functional preferences in the brain, and provides motivation for further hypothesis-driven investigation of visual cortex.

Analysis of BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity

The paper "BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity" presents an innovative approach to understanding the semantic selectivity of the human visual cortex. By leveraging recent advancements in vision-LLMs and large-scale neural datasets, the research aims to provide interpretable natural language descriptions of neural selectivity on a voxel level, thus enhancing the exploration of higher visual cortex functionality.

Key Contributions

BrainSCUBA introduces a novel methodology for generating voxel-wise captions that characterize the visual stimuli likely to maximally activate specific brain regions. This method utilizes a contrastive vision-language pre-trained model, CLIP, combined with a linear projector to bridge the modality gap between neural activations and natural images. The resultant voxel-wise captions are rich, interpretable, and fine-grained, situating BrainSCUBA as a valuable tool for neuroscientific research.

Methodology

The BrainSCUBA framework comprises several focused components:

  1. Image-to-Brain Encoder Construction: The research employs a CLIP model as the backbone to extract semantic embeddings of images. This model is connected to an fMRI encoder trained to predict voxel-wise brain activations using a linear probe.
  2. Interpretable Captioning: Instead of mapping brain activations directly to images, BrainSCUBA generates semantic captions by projecting voxel-wise weights into the space of CLIP embeddings. This projection employs a decoupled approach, optimizing both the direction and magnitude to align more closely with natural image embeddings.
  3. Text-Guided Image Synthesis: BrainSCUBA uses generated captions for text-conditioned diffusion models to produce images. This process not only validates the quality of the textual outputs but also provides visual stimuli that can be used for further neuroscientific experimentation.

Results and Implications

The researchers evaluated BrainSCUBA using the Natural Scenes Dataset and demonstrated its capability to produce reliable and category-specific captions across various functional regions of the brain. Notably, the methodology was able to discern fine-grained semantic selectivity in face and body-responsive regions, corroborating known neuroscientific concepts. In some cases, BrainSCUBA uncovered previously unreported neural patterns, such as the variations within the extrastriate body area, which suggests the potential to inform new hypotheses.

The implications of this research are noteworthy. BrainSCUBA paves the way for a deeper, more comprehensive understanding of the visual cortex by providing a tool that outputs human-readable explanations for neural activations. This capability could empower researchers to pursue hypothesis-driven inquiries more effectively and guide the development of new experiments targeting unexplored cortical regions.

Future Directions

While BrainSCUBA offers substantial insight into cortical selectivity, there remains room for extension and refinement. Future work could focus on overcoming the inherent biases in pre-trained LLMs, ensuring that the captions generated are not only comprehensive but also devoid of stereotype influences. Furthermore, the integration of more powerful LLMs could enhance the depth and diversity of captions, facilitating even broader neuroscientific investigations.

In conclusion, BrainSCUBA represents a significant step forward in neural decoding by transforming complex brain activation patterns into interpretable semantic content. This progression opens new avenues for exploring the neural substrates of vision and could have lasting impacts on the fields of cognitive neuroscience and artificial intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence. Nature neuroscience, 25(1):116–126, 2022.
  2. Neural population control via deep image synthesis. Science, 364(6439):eaav9436, 2019.
  3. Exemplary natural images explain cnn activations better than state-of-the-art feature visualization. arXiv preprint arXiv:2010.12606, 2020.
  4. The precuneus: a review of its functional anatomy and behavioural correlates. Brain, 129(3):564–583, 2006.
  5. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. arXiv preprint arXiv:2211.06956, 1(2):4, 2022.
  6. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22710–22720, 2023.
  7. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2818–2829, 2023.
  8. The visual word form area: spatial and temporal characterization of an initial stage of reading in normal subjects and posterior split-brain patients. Brain, 123(2):291–307, 2000.
  9. What can 1.8 billion regressions tell us about the pressures shaping high-level visual representation in brains and machines? bioRxiv, 2023. doi: 10.1101/2022.03.28.485868.
  10. The representation of semantic information across human cerebral cortex during listening versus reading is invariant to stimulus modality. Journal of Neuroscience, 39(39):7722–7736, 2019.
  11. Stimulus-selective properties of inferior temporal neurons in the macaque. Journal of Neuroscience, 4(8):2051–2062, 1984.
  12. Semantic scene descriptions as an objective of human vision. arXiv preprint arXiv:2209.11737, 2022.
  13. A cortical area selective for visual processing of the human body. Science, 293(5539):2470–2473, 2001.
  14. Seeing it all: Convolutional network layers map the function of the human visual system. NeuroImage, 152:184–194, 2017.
  15. A cortical representation of the local visual environment. Nature, 392(6676):598–601, 1998.
  16. A natural approach to studying vision. Nature neuroscience, 8(12):1643–1646, 2005.
  17. Brain captioning: Decoding human brain activity into images and text. arXiv preprint arXiv:2305.11560, 2023.
  18. Neural activity in areas v1, v2 and v4 during free viewing of natural scenes compared to controlled viewing. Neuroreport, 9(7):1673–1678, 1998.
  19. Unified concept editing in diffusion models. IEEE/CVF Winter Conference on Applications of Computer Vision, 2024.
  20. Can face recognition really be dissociated from object recognition? Journal of cognitive neuroscience, 11(4):349–370, 1999.
  21. A multi-modal parcellation of human cerebral cortex. Nature, 536(7615):171–178, 2016.
  22. Kalanit Grill-Spector. The neural basis of object perception. Current opinion in neurobiology, 13(2):159–166, 2003.
  23. NeuroGen: activation optimized image synthesis for discovery neuroscience. NeuroImage, 247:118812, 2022.
  24. Human brain responses are modulated when exposed to optimized natural images or synthetically generated images. Communications Biology, 6(1):1076, 2023.
  25. Variational autoencoder: An unsupervised model for encoding and decoding fmri activity in visual cortex. NeuroImage, 198:125–136, 2019.
  26. A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron, 76(6):1210–1224, 2012.
  27. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600):453–458, 2016.
  28. The inferior parietal lobule and temporoparietal junction: a network perspective. Neuropsychologia, 105:70–83, 2017.
  29. Selectivity for food in human ventral visual cortex. Communications Biology 2023 6:1, 6:1–14, 2 2023. ISSN 2399-3642. doi: 10.1038/s42003-023-04546-2.
  30. Decoding the visual and subjective contents of the human brain. Nature neuroscience, 8(5):679–685, 2005.
  31. The fusiform face area: a module in human extrastriate cortex specialized for face perception. Journal of neuroscience, 17(11):4302–4311, 1997.
  32. High-level visual areas act like domain-general filters with strong selectivity and functional specialization. bioRxiv, pp.  2022–03, 2022.
  33. A highly selective response to food in human visual cortex revealed by hypothesis-free voxel decomposition. Current Biology, 32:1–13, 2022.
  34. Brain-like object recognition with high-performing shallow recurrent anns. Advances in neural information processing systems, 32, 2019.
  35. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  36. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  37. Decap: Decoding clip latents for zero-shot captioning via text-only training. arXiv preprint arXiv:2303.03032, 2023b.
  38. Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding from fmri. arXiv preprint arXiv:2302.12971, 2023.
  39. Minddiffuser: Controlled image reconstruction from human brain activity with semantic and structural diffusion. arXiv preprint arXiv:2303.14139, 2023.
  40. Brain diffusion for visual exploration: Cortical discovery using large scale generative models. arXiv preprint arXiv:2306.03089, 2023.
  41. Eleanor Maguire. The retrosplenial contribution to human navigation: a review of lesion and neuroimaging findings. Scandinavian journal of psychology, 42(3):225–238, 2001.
  42. Unibrain: Unify image reconstruction and captioning all in one diffusion model from human brain activity. arXiv preprint arXiv:2308.07428, 2023.
  43. Face-specific processing in the human fusiform gyrus. Journal of cognitive neuroscience, 9(5):605–610, 1997.
  44. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
  45. The “visual word form area” is involved in successful memory encoding of both words and faces. Neuroimage, 52(1):371–378, 2010.
  46. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  47. Encoding and decoding in fmri. Neuroimage, 56(2):400–410, 2011.
  48. Brain-diffuser: Natural scene reconstruction from fmri signals using generative latent diffusion. arXiv preprint arXiv:2303.05334, 2023.
  49. Color-biased regions in the ventral visual pathway are food selective. Current Biology, 33(1):134–146, 2023.
  50. Look at those two!: The precuneus role in unattended third-person perspective of social interactions. Human Brain Mapping, 35(10):5190–5203, 2014.
  51. Evidence for a third visual pathway specialized for social perception. Trends in Cognitive Sciences, 25:100–110, 2 2021. ISSN 1364-6613. doi: 10.1016/J.TICS.2020.11.006.
  52. Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences. Cell, 177(4):999–1009, 2019.
  53. Visual and linguistic semantic representations are aligned at the border of human visual cortex. Nature neuroscience, 24(11):1628–1636, 2021.
  54. Improving the accuracy of single-trial fmri response estimates using glmsingle. eLife, 11:e77599, nov 2022. ISSN 2050-084X. doi: 10.7554/eLife.77599.
  55. Differential sensitivity of human visual cortex to faces, letterstrings, and textures: a functional magnetic resonance imaging study. Journal of neuroscience, 16(16):5205–5215, 1996.
  56. Learning transferable visual models from natural language supervision. In ICML, 2021.
  57. Computational models of category-selective brain regions enable high-throughput tests of selectivity. Nature communications, 12(1):5540, 2021.
  58. Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learning. NeuroImage, 228:117602, 2021.
  59. People thinking about thinking people: the role of the temporo-parietal junction in “theory of mind”. In Social neuroscience, pp.  171–182. Psychology Press, 2013.
  60. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  61. Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage, 181:775–785, 2018.
  62. Deep image reconstruction from human brain activity. PLoS computational biology, 15(1):e1006633, 2019.
  63. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021.
  64. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  65. Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. bioRxiv, pp.  2022–11, 2022.
  66. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  17918–17928, 2022.
  67. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  68. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  69. Inception loops discover what excites neurons most using deep predictive models. Nature neuroscience, 22(12):2060–2065, 2019.
  70. Incorporating natural language into vision models improves prediction and understanding of higher visual cortex. BioRxiv, pp.  2022–09, 2022.
  71. Deep residual network predicts cortical representation and organization of visual features for rapid categorization. Scientific reports, 8(1):3752, 2018.
  72. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the national academy of sciences, 111(23):8619–8624, 2014.
  73. How well do feature visualizations support causal understanding of cnn activations? Advances in Neural Information Processing Systems, 34:11730–11744, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Andrew F. Luo (7 papers)
  2. Margaret M. Henderson (6 papers)
  3. Michael J. Tarr (20 papers)
  4. Leila Wehbe (15 papers)
Citations (10)