Interpreting the linear structure of vision-language model embedding spaces (2504.11695v4)

Published 16 Apr 2025 in cs.CV, cs.MM, and cs.CL

Abstract: Vision-LLMs encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-LLMs (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or "concepts". We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that commonly-activating concepts are remarkably stable across runs. Interestingly, while most concepts activate primarily for one modality, we find they are not merely encoding modality per se. Many are almost orthogonal to the subspace that defines modality, and the concept directions do not function as good modality classifiers, suggesting that they encode cross-modal semantics. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even single-modality concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges, offering new insight into how multimodal meaning is constructed.

Summary

The paper demonstrates that sparse autoencoders and Top-K methods extract semantically meaningful linear directions from VLM embedding spaces.
It shows high-energy concepts are robust across different training runs, significantly contributing to reconstruction quality.
The study introduces the Bridge Score to measure cross-modal alignment, highlighting interactions between text and image features.

Interpreting the Linear Structure of Vision-LLM Embedding Spaces

The paper "Interpreting the Linear Structure of Vision-LLM Embedding Spaces" (2504.11695) explores the organization of joint embedding spaces in vision-LLMs (VLMs). This paper involves training sparse autoencoders (SAEs) on embeddings from multiple VLMs to extract semantically meaningful linear directions termed "concepts." This work provides insights into the expressivity and sparsity trade-offs in dictionary learning methods and introduces metrics for evaluating concept robustness and cross-modal alignment.

Dictionary Learning and Sparse Autoencoders

The authors employed several dictionary learning methods, including Sparse Autoencoders (SAEs), Semi-NMF, and Top-K SAEs, to analyze their performance in capturing the linear structure of vision-language embeddings. The evaluation was based on expressivity, as measured by $R^2$ scores, and sparsity, quantified by the $\ell_0$ norm of the code matrix $\mathbf{Z}$ .

Figure 1: Selecting a sparse dictionary learning method: the Expressivity-Sparsity trade-off. Pareto fronts for five dictionary learning methods applied to four vision-LLMs.

The paper found that TopK-based SAEs generally outperform other methods in balancing reconstruction quality and code sparsity, highlighting their efficacy in extracting interpretable features from VLMs.

Stability and Robustness of Conceptual Features

To assess the stability of extracted concepts, SAEs were retrained with different random seeds and data mixtures. The stability analysis revealed that although the overall dictionary stability was low, high-energy concepts demonstrated remarkable consistency across different training runs. These concepts captured a substantial portion of reconstruction mass, emphasizing their functional importance.

Figure 2: The concepts that use most of the energy are stable.

Figure 3: The geometry of high-energy concepts is stable across data mixtures.

The paper revealed that most concepts were single-modality, as evidenced by their modality scores, which largely clustered at the extremes. This suggests that concepts typically activate either for text or images, despite residing in a joint embedding space. However, these concepts often align with directions orthogonal to the modality-defining subspace, potentially encoding shared or cross-modal semantics.

Figure 4: Most concepts are single-modality.

Figure 5: Many concepts are not aligned with the modality directions.

Bridging Concepts Across Modalities

Through the introduction of the Bridge Score, which measures co-activation and geometric alignment, the paper identifies concept pairs that facilitate cross-modal connections within the embedding space. Despite the unimodal activation tendencies of individual concepts, the bridge score elucidates interactions that support semantic integration across modalities.

Figure 6: Image and text activations lie in separate cones.

Figure 7: Bridge score identifies semantically aligned concept pairs across modalities.

Conclusion

This work provides a nuanced understanding of how vision-LLMs organize semantic content in their embedding spaces. By uncovering the sparse linear structure and highlighting mechanisms of cross-modal alignment, it sets a foundation for developing more interpretable multimodal systems. The introduction of the VLM-Explore tool further facilitates interactions with the conceptual space of VLMs, fostering future research in multimodal interpretability strategies.