Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visual representations in the human brain are aligned with large language models (2209.11737v2)

Published 23 Sep 2022 in cs.CV, cs.LG, and q-bio.NC

Abstract: The human brain extracts complex information from visual inputs, including objects, their spatial and semantic interrelations, and their interactions with the environment. However, a quantitative approach for studying this information remains elusive. Here, we test whether the contextual information encoded in LLMs is beneficial for modelling the complex visual information extracted by the brain from natural scenes. We show that LLM embeddings of scene captions successfully characterise brain activity evoked by viewing the natural scenes. This mapping captures selectivities of different brain areas, and is sufficiently robust that accurate scene captions can be reconstructed from brain activity. Using carefully controlled model comparisons, we then proceed to show that the accuracy with which LLM representations match brain representations derives from the ability of LLMs to integrate complex information contained in scene captions beyond that conveyed by individual words. Finally, we train deep neural network models to transform image inputs into LLM representations. Remarkably, these networks learn representations that are better aligned with brain representations than a large number of state-of-the-art alternative models, despite being trained on orders-of-magnitude less data. Overall, our results suggest that LLM embeddings of scene captions provide a representational format that accounts for complex information extracted by the brain from visual inputs.

The Role of Semantic Scene Descriptions in Human Visual Processing

The paper "Semantic scene descriptions as an objective of human vision" by Doerig et al. offers a significant contribution to the field of visual neuroscience by proposing semantic scene descriptions as a core objective of human visual processing rather than mere object categorization. The authors employ a vast array of techniques, including high-resolution 7T fMRI data analysis and advanced neural network models, to substantiate their claims.

Main Findings

The research utilizes a large-scale fMRI dataset (the Natural Scenes Dataset) to explore how the brain represents complex natural scenes. A key finding is the identification of a distributed brain network, extending into regions traditionally not associated with high-level semantic processing, that better correlates with semantic embeddings of scenes than with object category labels. This correlation persists even when participants were not performing a semantic task, suggesting a default mode of human vision that processes scenes semantically.

Furthermore, the authors successfully decode semantic scene descriptions from brain activity and demonstrate that a recurrent convolutional neural network (RCNN) trained on semantic embeddings outperforms standard semantic embeddings in predicting neural responses. These results collectively indicate that human visual processing could primarily aim to construct comprehensive semantic scene descriptions.

Implications and Future Speculations

The implications of this research are manifold. It challenges the traditional view of the visual system's function as predominantly object recognition and expands this to include the processing of relational and contextual semantics. Practically, this can enhance computational models used in AI, steering them towards a focus on semantic scene descriptions rather than isolated objects, potentially improving scene understanding capabilities in technology.

Theoretically, this work aligns with the idea that visual perception exists on a continuum of representation from low-level features to high-level semantics, integrated across brain regions. It encourages rethinking visual processing as part of a broader semantic system that benefits from cross-modal interactions and feedback loops.

Future research could further elucidate the mechanisms by which the brain accomplishes these visuo-semantic transformations. Efforts might focus on characterizing the temporal dynamics of feedback mechanisms identified in the paper or comparing different semantic embeddings to optimize fMRI predictions. Additionally, unraveling the influence of semantics in early visual processing stages might refine our understanding of how vision integrates detailed sensory input into comprehensive mental representations.

In conclusion, this paper paves the way for new explorations in visual neuroscience by positing that the transformation of visual inputs into rich semantic descriptions is central to human vision. By focusing on semantic processing, it proposes a potentially more holistic approach to both human and artificial vision systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Adrien Doerig (7 papers)
  2. Emily Allen (3 papers)
  3. Yihan Wu (44 papers)
  4. Thomas Naselaris (10 papers)
  5. Kendrick Kay (7 papers)
  6. Ian Charest (2 papers)
  7. Tim C Kietzmann (3 papers)
Citations (18)