Visual representations in the human brain are aligned with large language models (2209.11737v2)

Published 23 Sep 2022 in cs.CV, cs.LG, and q-bio.NC

Abstract: The human brain extracts complex information from visual inputs, including objects, their spatial and semantic interrelations, and their interactions with the environment. However, a quantitative approach for studying this information remains elusive. Here, we test whether the contextual information encoded in LLMs is beneficial for modelling the complex visual information extracted by the brain from natural scenes. We show that LLM embeddings of scene captions successfully characterise brain activity evoked by viewing the natural scenes. This mapping captures selectivities of different brain areas, and is sufficiently robust that accurate scene captions can be reconstructed from brain activity. Using carefully controlled model comparisons, we then proceed to show that the accuracy with which LLM representations match brain representations derives from the ability of LLMs to integrate complex information contained in scene captions beyond that conveyed by individual words. Finally, we train deep neural network models to transform image inputs into LLM representations. Remarkably, these networks learn representations that are better aligned with brain representations than a large number of state-of-the-art alternative models, despite being trained on orders-of-magnitude less data. Overall, our results suggest that LLM embeddings of scene captions provide a representational format that accounts for complex information extracted by the brain from visual inputs.

PDF Abstract

The Role of Semantic Scene Descriptions in Human Visual Processing

The paper "Semantic scene descriptions as an objective of human vision" by Doerig et al. offers a significant contribution to the field of visual neuroscience by proposing semantic scene descriptions as a core objective of human visual processing rather than mere object categorization. The authors employ a vast array of techniques, including high-resolution 7T fMRI data analysis and advanced neural network models, to substantiate their claims.

Main Findings

The research utilizes a large-scale fMRI dataset (the Natural Scenes Dataset) to explore how the brain represents complex natural scenes. A key finding is the identification of a distributed brain network, extending into regions traditionally not associated with high-level semantic processing, that better correlates with semantic embeddings of scenes than with object category labels. This correlation persists even when participants were not performing a semantic task, suggesting a default mode of human vision that processes scenes semantically.

Furthermore, the authors successfully decode semantic scene descriptions from brain activity and demonstrate that a recurrent convolutional neural network (RCNN) trained on semantic embeddings outperforms standard semantic embeddings in predicting neural responses. These results collectively indicate that human visual processing could primarily aim to construct comprehensive semantic scene descriptions.

Implications and Future Speculations

The implications of this research are manifold. It challenges the traditional view of the visual system's function as predominantly object recognition and expands this to include the processing of relational and contextual semantics. Practically, this can enhance computational models used in AI, steering them towards a focus on semantic scene descriptions rather than isolated objects, potentially improving scene understanding capabilities in technology.

Theoretically, this work aligns with the idea that visual perception exists on a continuum of representation from low-level features to high-level semantics, integrated across brain regions. It encourages rethinking visual processing as part of a broader semantic system that benefits from cross-modal interactions and feedback loops.

Future research could further elucidate the mechanisms by which the brain accomplishes these visuo-semantic transformations. Efforts might focus on characterizing the temporal dynamics of feedback mechanisms identified in the paper or comparing different semantic embeddings to optimize fMRI predictions. Additionally, unraveling the influence of semantics in early visual processing stages might refine our understanding of how vision integrates detailed sensory input into comprehensive mental representations.

In conclusion, this paper paves the way for new explorations in visual neuroscience by positing that the transformation of visual inputs into rich semantic descriptions is central to human vision. By focusing on semantic processing, it proposes a potentially more holistic approach to both human and artificial vision systems.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Adrien Doerig (7 papers)
Emily Allen (3 papers)
Yihan Wu (44 papers)
Thomas Naselaris (10 papers)
Kendrick Kay (7 papers)
Ian Charest (2 papers)
Tim C Kietzmann (3 papers)

Citations (18)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/AdrienDoerig/status/1810675539109187633

https://twitter.com/BioPapers/status/1810605298509332912

https://twitter.com/voooooogel/status/1812929147746066898

https://twitter.com/TimKietzmann/status/1901750678235840624