BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity

Published 6 Oct 2023 in cs.LG and q-bio.NC | (2310.04420v3)

Abstract: Understanding the functional organization of higher visual cortex is a central focus in neuroscience. Past studies have primarily mapped the visual and semantic selectivity of neural populations using hand-selected stimuli, which may potentially bias results towards pre-existing hypotheses of visual cortex functionality. Moving beyond conventional approaches, we introduce a data-driven method that generates natural language descriptions for images predicted to maximally activate individual voxels of interest. Our method -- Semantic Captioning Using Brain Alignments ("BrainSCUBA") -- builds upon the rich embedding space learned by a contrastive vision-LLM and utilizes a pre-trained LLM to generate interpretable captions. We validate our method through fine-grained voxel-level captioning across higher-order visual regions. We further perform text-conditioned image synthesis with the captions, and show that our images are semantically coherent and yield high predicted activations. Finally, to demonstrate how our method enables scientific discovery, we perform exploratory investigations on the distribution of "person" representations in the brain, and discover fine-grained semantic selectivity in body-selective areas. Unlike earlier studies that decode text, our method derives voxel-wise captions of semantic selectivity. Our results show that BrainSCUBA is a promising means for understanding functional preferences in the brain, and provides motivation for further hypothesis-driven investigation of visual cortex.

Abstract PDF HTML Upgrade to Chat

References (73)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces BrainSCUBA, a novel method that uses a CLIP vision-language model and fMRI data to generate fine-grained, voxel-wise natural language captions describing visual cortex selectivity.
Evaluation on the Natural Scenes Dataset showed BrainSCUBA produces reliable, category-specific captions and can identify fine-grained semantic selectivity, even uncovering previously unreported neural patterns.
BrainSCUBA offers a valuable tool that translates complex brain activation patterns into human-readable semantic content, advancing the understanding of the visual cortex and guiding future research.

Analysis of BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity

The paper "BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity" presents an innovative approach to understanding the semantic selectivity of the human visual cortex. By leveraging recent advancements in vision-LLMs and large-scale neural datasets, the research aims to provide interpretable natural language descriptions of neural selectivity on a voxel level, thus enhancing the exploration of higher visual cortex functionality.

Key Contributions

BrainSCUBA introduces a novel methodology for generating voxel-wise captions that characterize the visual stimuli likely to maximally activate specific brain regions. This method utilizes a contrastive vision-language pre-trained model, CLIP, combined with a linear projector to bridge the modality gap between neural activations and natural images. The resultant voxel-wise captions are rich, interpretable, and fine-grained, situating BrainSCUBA as a valuable tool for neuroscientific research.

Methodology

The BrainSCUBA framework comprises several focused components:

Image-to-Brain Encoder Construction: The research employs a CLIP model as the backbone to extract semantic embeddings of images. This model is connected to an fMRI encoder trained to predict voxel-wise brain activations using a linear probe.
Interpretable Captioning: Instead of mapping brain activations directly to images, BrainSCUBA generates semantic captions by projecting voxel-wise weights into the space of CLIP embeddings. This projection employs a decoupled approach, optimizing both the direction and magnitude to align more closely with natural image embeddings.
Text-Guided Image Synthesis: BrainSCUBA uses generated captions for text-conditioned diffusion models to produce images. This process not only validates the quality of the textual outputs but also provides visual stimuli that can be used for further neuroscientific experimentation.

Results and Implications

The researchers evaluated BrainSCUBA using the Natural Scenes Dataset and demonstrated its capability to produce reliable and category-specific captions across various functional regions of the brain. Notably, the methodology was able to discern fine-grained semantic selectivity in face and body-responsive regions, corroborating known neuroscientific concepts. In some cases, BrainSCUBA uncovered previously unreported neural patterns, such as the variations within the extrastriate body area, which suggests the potential to inform new hypotheses.

The implications of this research are noteworthy. BrainSCUBA paves the way for a deeper, more comprehensive understanding of the visual cortex by providing a tool that outputs human-readable explanations for neural activations. This capability could empower researchers to pursue hypothesis-driven inquiries more effectively and guide the development of new experiments targeting unexplored cortical regions.

Future Directions

While BrainSCUBA offers substantial insight into cortical selectivity, there remains room for extension and refinement. Future work could focus on overcoming the inherent biases in pre-trained LLMs, ensuring that the captions generated are not only comprehensive but also devoid of stereotype influences. Furthermore, the integration of more powerful LLMs could enhance the depth and diversity of captions, facilitating even broader neuroscientific investigations.

In conclusion, BrainSCUBA represents a significant step forward in neural decoding by transforming complex brain activation patterns into interpretable semantic content. This progression opens new avenues for exploring the neural substrates of vision and could have lasting impacts on the fields of cognitive neuroscience and artificial intelligence.

Markdown