Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models (2509.20751v1)

Published 25 Sep 2025 in cs.CV, cs.AI, and cs.CL

Abstract: Recent studies show that deep vision-only and language-only models--trained on disjoint modalities--nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of where in each network this convergence emerges, what visual or linguistic cues support it, whether it captures human preferences in many-to-many image-text scenarios, and how aggregating exemplars of the same concept affects alignment. Here, we systematically investigate these questions. We find that alignment peaks in mid-to-late layers of both model types, reflecting a shift from modality-specific to conceptually shared representations. This alignment is robust to appearance-only changes but collapses when semantics are altered (e.g., object removal or word-order scrambling), highlighting that the shared code is truly semantic. Moving beyond the one-to-one image-caption paradigm, a forced-choice "Pick-a-Pic" task shows that human preferences for image-caption matches are mirrored in the embedding spaces across all vision-LLM pairs. This pattern holds bidirectionally when multiple captions correspond to a single image, demonstrating that models capture fine-grained semantic distinctions akin to human judgments. Surprisingly, averaging embeddings across exemplars amplifies alignment rather than blurring detail. Together, our results demonstrate that unimodal networks converge on a shared semantic code that aligns with human judgments and strengthens with exemplar aggregation.

Summary

The paper reveals that cross-modal alignment emerges predominantly in the mid-to-late layers of vision and language models.
It demonstrates that semantic content, rather than superficial attributes, drives robust alignment between modalities.
The study finds that aggregation of embeddings enhances semantic coherence, reflecting human-like judgment in image-caption tasks.

Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and LLMs

Overview

The paper "Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and LLMs" investigates the degree of alignment between separately trained vision and LLMs and explores where in these models such convergence occurs. The paper provides insights into the shared semantic space between unimodal networks and how they capture human-like judgments.

The research identifies that alignment emerges prominently in the mid-to-late layers of both vision and LLMs. This observation underscores a transition from modality-specific processing to a shared, abstract conceptual space. Notably, while early layers exhibit weak cross-modal predictivity, predictivity strengthens and aligns toward the latter layers.

Figure 1: Layer-wise alignment, measured using linear predictivity score, between an example vision model (ViT-Large-Dinov2) and all LLMs.

Dependency on Semantic Content

The paper probes the factors that drive alignment by manipulating input data at both visual and linguistic levels. Findings demonstrate that semantic content—not surface attributes—primarily drives alignment. For images, superficial manipulations (e.g., grayscale conversion) caused negligible alignment changes, whereas semantic disruptions (e.g., object removal) significantly degraded alignment. Similarly, for text, disruptions in semantic structure (e.g., noun-only captions) resulted in diminished alignment.

Figure 2: (A) Example 'thing-only' and 'stuff-only' images created by manipulating the original image using masks. (B) Alignment by image manipulations.

Mirroring Human Judgments

In their analyses, the researchers find that the representational alignment mirrors fine-grained human preferences. Evaluation using the 'Pick-a-Pic' task and CLIP Score-based assessments revealed that preferred image-caption pairs exhibit stronger semantic alignment. This suggests that the shared representation aligns well with human semantic judgments, capturing subtle distinctions.

Figure 3: (A) Pick-a-Pic dataset Linear predictivity scores grouped by image variation based on human judgments. (B) MS-COCO dataset Linear predictivity scores grouped by caption variation.

Enhancing Alignment Through Aggregation

Interestingly, the approach of averaging embeddings across multiple examples (captions or images) pertaining to the same concept amplifies alignment. This counters intuitive expectations of blurring and demonstrates that aggregation distills a stable semantic core. Such findings challenge existing beliefs in representation specificity and propose aggregation as a robust technique for reinforcing alignment.

Figure 4: Effect of aggregation on alignment. Cross-modal aggregation enhances language–vision and vision–language predictivity.

Broader Implications and Future Directions

The paper enriches cognitive and computational models by bridging insights between how artificial and biological systems may converge on amodal representations of semantics. The findings open inquiries into how abstract versus concrete concepts influence alignment or how different visual mediums (e.g., illustrations vs. photographs) alter semantic convergence. The paper also prompts an exploration into the geometric properties of averaged embeddings as possible denoising mechanisms. Observing alignment progress across training stages may shed light on developmental trajectories in semantic learning.

Conclusion

This detailed paper affirms that a universally shared semantic space is emergent between sophisticated, unimodal vision and language networks. Importantly, this shared space aligns closely with human cognitive judgments and highlights novel areas such as exemplar aggregation to enhance multi-modal semantic coherence. The findings propose multiple future research avenues into the intrinsic properties of cross-modal shared representations, their evolution, and their implications in deriving human-like semantic inference.