Do language-only models learn the same universal visual dimensions?

Determine whether neural networks trained exclusively on language data learn the same universal dimensions of natural scene representation as those learned by image-trained vision networks; here, universal dimensions refer to latent principal-component dimensions of image representations that are convergently learned across diverse vision architectures and tasks and align with human visual cortex responses.

Background

The paper identifies universal dimensions of natural image representation that are shared across diverse trained vision networks and are strongly aligned with human visual cortex. These dimensions are defined as latent components that are consistently predictable across networks and from fMRI responses, and they drive conventional representational similarity between models and the brain.

Prior work suggests that language-model embeddings of object names and scene captions can predict high-level visual cortical representations, motivating the question of whether language-only training can give rise to the same universal visual dimensions found in image-trained networks.

References

An open question is whether networks trained on language data alone learn the same universal dimensions of natural scene representation as image-trained networks.

— Universal dimensions of visual representation (2408.12804 - Chen et al., 23 Aug 2024) in Discussion, future work paragraph

Do language-only models learn the same universal visual dimensions?

Background

References

Related Problems