Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language modulates vision: Evidence from neural networks and human brain-lesion models (2501.13628v1)

Published 23 Jan 2025 in q-bio.NC

Abstract: Comparing information structures in between deep neural networks (DNNs) and the human brain has become a key method for exploring their similarities and differences. Recent research has shown better alignment of vision-language DNN models, such as CLIP, with the activity of the human ventral occipitotemporal cortex (VOTC) than earlier vision models, supporting the idea that language modulates human visual perception. However, interpreting the results from such comparisons is inherently limited due to the "black box" nature of DNNs. To address this, we combined model-brain fitness analyses with human brain lesion data to examine how disrupting the communication pathway between the visual and language systems causally affects the ability of vision-language DNNs to explain the activity of the VOTC. Across four diverse datasets, CLIP consistently outperformed both label-supervised (ResNet) and unsupervised (MoCo) models in predicting VOTC activity. This advantage was left-lateralized, aligning with the human language network. Analyses of the data of 33 stroke patients revealed that reduced white matter integrity between the VOTC and the language region in the left angular gyrus was correlated with decreased CLIP performance and increased MoCo performance, indicating a dynamic influence of language processing on the activity of the VOTC. These findings support the integration of language modulation in neurocognitive models of human vision, reinforcing concepts from vision-language DNN models. The sensitivity of model-brain similarity to specific brain lesions demonstrates that leveraging manipulation of the human brain is a promising framework for evaluating and developing brain-like computer models.

Summary

  • The paper uses computational models and human brain-lesion data to demonstrate that language actively modulates visual processing in the ventral occipitotemporal cortex (VOTC).
  • Using fMRI data, the study showed that vision-language trained models like CLIP better fit human VOTC activity than image-centric models, especially in the left hemisphere.
  • Brain-lesion analysis revealed causal links, showing damage to tracts connecting VOTC and language regions impairs vision-language model fitting, highlighting language's essential role and informing AI development.

Influence of Language on Visual Processing in the Human Brain and Neural Networks

This paper underscores the pivotal role of language in modulating vision, examining both human brain models and neural networks, to unravel the intertwining architecture of visual and language processing. The authors explore how multimodal vision-language deep neural networks (DNNs), specifically those trained to align visual features with linguistic data, emulate the functionalities of the human ventral occipitotemporal cortex (VOTC). This investigation unveils the nuanced superiority of certain DNN models, particularly CLIP, in mirroring the activity of the VOTC, a region integral to visual perception and object recognition.

Methodology and Results Overview

Study 1: Comparative Analysis of Vision Models

The first part of the paper leverages computational model-brain fitness analyses across four fMRI datasets, evaluating how language-trained models explain VOTC activity in the human brain. The datasets vary in their stimuli, tasks, and participant demographics, including speech-capable and sign language users. Three models were compared: CLIP, which combines visual and sentence-level language training; ResNet-50, reliant on word-level image categorization; and MoCo, a self-supervised image-centric model.

  • CLIP Outperformance: Across datasets, CLIP exhibited superior fidelity in fitting VOTC activity over ResNet-50 and MoCo, particularly noted in left hemisphere processing. This alignment is believed to emanate from CLIP's capability to integrate higher-order relational structures while assimilating both visual and linguistic elements.
  • Lateralization and Language Integration: The analyses illustrated consistent leftward lateralization in the VOTC for CLIP, mirroring the linguistic network's lateralization in humans, thereby reinforcing the hypothesis that linguistic integration aids in the interpretative framework of visual stimuli.

Study 2: Examining Causal Connections Using Brain Lesion Models

The second paper focused on verifying the causal influence of language on visual processing through patient analyses involving individuals with brain damage. Here, the authors linked diminished CLIP model effectiveness to disrupted white matter tracts between the VOTC and language regions in stroke patients, primarily the left angular gyrus (L-AG). This disruption highlighted increased MoCo model fitting, suggesting a compensatory reliance on lower-level visual structures.

Implications and Speculations

  • Cognitive Neuroscience Paradigm Shift: The incorporation of language cues into neurocognitive models of vision offers noteworthy insights into the syntactic and semantic modulations that potentially shape visual perception, emphasizing language's profound influence in visual cortex representations.
  • Advances in DNN Model Design: Insight from these analyses suggests the integration of language-oriented training in DNNs is beneficial, potentially informing the development of more brain-like artificial intelligence systems capable of nuanced perception and interpretation akin to human cognition.
  • Future Directions: Further exploration could solidify understanding of cerebral multi-modal integration, determining specific pathways and interconnections that facilitate such interactions. This research paves the way for exploring other high-order cognitive functions that may benefit from multimodal AI training regimes.

In summation, this paper underscores a significant interplay between language and vision, both in human and artificial intelligence frameworks, revealing the intricate layers and dynamics that govern high-level perceptual processes. The differential model performance, coupled with empirical evidence from brain-lesion analyses, forms a nuanced narrative on the integral role of language in modulating visual cognitive architecture.