Papers
Topics
Authors
Recent
Search
2000 character limit reached

Lost in Embeddings: Information Loss in Vision-Language Models

Published 15 Sep 2025 in cs.CV and cs.CL | (2509.11986v1)

Abstract: Vision--LLMs (VLMs) often process visual inputs through a pretrained vision encoder, followed by a projection into the LLM's embedding space via a connector component. While crucial for modality fusion, the potential information loss induced by this projection step and its direct impact on model capabilities remain understudied. We introduce two complementary approaches to examine and quantify this loss by analyzing the latent representation space. First, we evaluate semantic information preservation by analyzing changes in k-nearest neighbor relationships between image representations, before and after projection. Second, we directly measure information loss by reconstructing visual embeddings from the projected representation, localizing loss at an image patch level. Experiments reveal that connectors substantially distort the local geometry of visual representations, with k-nearest neighbors diverging by 40--60\% post-projection, correlating with degradation in retrieval performance. The patch-level embedding reconstruction provides interpretable insights for model behavior on visually grounded question-answering tasks, finding that areas of high information loss reliably predict instances where models struggle.

Summary

  • The paper quantifies connector-induced information loss by introducing KNOR and patch-level reconstruction metrics to reveal geometric and semantic distortions in VLM connectors.
  • It employs a dual-method approach—k-nearest neighbor overlap and per-patch reconstruction loss—to analyze how connector architectures affect downstream tasks like retrieval, captioning, and VQA.
  • The study demonstrates that increased information loss in critical image regions correlates with degraded model performance, underscoring the need for adaptive connector designs.

Information Loss in Vision-LLM Connectors: Quantification and Implications

Introduction

This paper presents a systematic investigation into the information loss induced by connector modules in vision-LLMs (VLMs). Connectors, typically implemented as MLPs or attention-based modules, project high-dimensional visual features from pretrained vision encoders into the embedding space of LLMs. While these connectors are essential for modality fusion, their impact on the preservation of semantic and structural information in visual representations has not been rigorously quantified. The authors introduce two complementary methodologies—kk-nearest neighbor overlap ratio (KNOR) and patch-level embedding reconstruction—to measure and localize information loss, and empirically analyze its consequences for downstream tasks such as retrieval, captioning, and visual question answering (VQA).

Quantifying Information Loss: Methodological Framework

The first approach, KNOR, evaluates the preservation of local geometric structure in the latent space by measuring the overlap of kk-nearest neighbors for each image before and after the connector projection. A high overlap ratio indicates faithful retention of semantic relationships, while a low ratio signals substantial geometric distortion. Figure 1

Figure 1: The k-nearest neighbors overlap ratio measures the overlap of an image's neighbors before and after projection. In this example, with k=3, the overlap ratio is 0.67 because two out of the three nearest neighbors are identical in both representation spaces.

The second approach involves training a high-capacity reconstruction model to recover the original vision encoder embeddings from the connector outputs. The per-patch reconstruction loss quantifies the degree of information loss at a fine-grained spatial level, enabling the identification of image regions where critical visual features are lost. Figure 2

Figure 2

Figure 2

Figure 2: Input image with red answer mask.

Empirical Analysis: Structural and Localized Information Loss

Neighborhood Overlap and Retrieval Performance

Across three representative open-weight VLMs—LLaVA, Idefics2, and Qwen2.5-VL—the KNOR analysis reveals that connectors induce substantial reordering of local neighborhoods in the embedding space. For all models and datasets, the average overlap ratio remains below 0.62, with LLaVA achieving a maximum of 61.6% at k=100k=100 and Qwen2.5-VL exhibiting the most severe loss (overlap ratio near 0.1 for small kk). Figure 3

Figure 3: Neighborhood overlap ratios across three datasets: SeedBench validation, a 10,000-sample subset of VQAv2 validation, and Vizwiz grounding VQA validation. Analysis using 10, 50, and 100 nearest neighbors shows overlap ratios below 0.62 for all models, suggesting connectors poorly preserve geometric relationships and neighbor rankings for the visual representations.

This geometric distortion correlates with degraded retrieval performance. For LLaVA and Idefics2, recall@5 on the CUB-200-2011 retrieval task drops by 41.4% and 18.8%, respectively, when using post-projection embeddings. Qwen2.5-VL, despite a low overlap ratio, paradoxically improves in semantic retrieval, likely due to continued pretraining of its vision encoder and aggressive patch merging, which fundamentally alters the representation space. Figure 4

Figure 4

Figure 4

Figure 4: Five nearest neighbors of LLaVA image embeddings.

Patch-Level Reconstruction and Downstream Task Performance

Patch-level reconstruction loss is strongly predictive of downstream task failures in LLaVA and Idefics2. For image captioning, higher average reconstruction loss correlates with lower CIDEr scores. The per-sample Spearman correlation between reconstruction loss and CIDEr is negative and statistically significant for both models, with the highest-loss quartile of images exhibiting markedly worse captioning performance.

For VQA tasks requiring fine-grained visual grounding (e.g., VizWiz), high reconstruction loss in answer-relevant patches reliably predicts incorrect model responses. Visualization of these high-loss regions demonstrates that critical details (such as digits or text) are often lost in the connector projection, directly constraining the LLM's reasoning capabilities. Figure 5

Figure 5: Correlation between reconstruction loss and question-answering accuracy on the VizWiz grounding VQA task. For LLaVA and Idefics2, all correlations have a p-value < \num{5e-5}.

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6: Additional visualization of high reconstruction loss patches that contributes to model's failure on answering questions that requires recognizing text in the objects. Left: input images with answer-relevant regions in red masks. Middle: signed difference between post-projection embeddings norms and pre-projection embedding norms. Right: normalized norm differences overlay with the input image, with highest loss patches marked in yellow.

Analysis of Connector Architectures

The study distinguishes between feature-preserving connectors (e.g., LLaVA's MLP) and feature-compressing connectors (e.g., Idefics2's perceiver resampler, Qwen2.5-VL's patch merger). Feature-compressing connectors exhibit more severe geometric distortion and higher patch-level information loss, but may yield more semantically meaningful representations for certain tasks, as observed in Qwen2.5-VL.

Procrustes analysis further demonstrates that linear alignment is insufficient to recover the pre-projection structure from post-projection embeddings, especially for LLaVA and Qwen2.5-VL, indicating that the information loss is not merely a rigid transformation but involves non-linear collapse or entanglement. Figure 7

Figure 7: Alignment visualization for LLaVA pre- and post-projection embeddings through PCA.

Implications and Future Directions

The findings have several implications for the design and evaluation of VLMs:

  • Connector-induced information loss is substantial and measurable: Both global (KNOR) and local (reconstruction loss) metrics reveal that current connector architectures often fail to preserve the semantic and structural integrity of visual representations.
  • Task performance is bottlenecked by connector fidelity: For tasks requiring fine-grained visual reasoning or grounding, information loss at the connector directly limits model accuracy.
  • Connector design should be context-sensitive: The trade-off between semantic abstraction and structural preservation must be explicitly managed. Incorporating reconstruction loss as a regularization objective during pretraining may improve downstream performance.
  • Visualization of patch-level loss provides interpretability: Localizing information loss enables targeted diagnosis of model failures and can inform the development of more robust connector architectures.

The authors suggest that future work should explore dynamic or adaptive projection mechanisms, improved feature selection for modality fusion, and the integration of information-preserving objectives into connector training.

Conclusion

This paper establishes a rigorous framework for quantifying and localizing information loss in VLM connectors, demonstrating that current architectures often induce significant geometric and semantic degradation in visual representations. The proposed metrics—KNOR and patch-level reconstruction loss—are shown to be predictive of downstream task performance and provide actionable insights for model development. These results underscore the necessity of rethinking connector design to balance semantic abstraction with the preservation of task-relevant visual information, and open avenues for principled improvements in multimodal model architectures.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about

This paper looks at a hidden but very important step inside vision–LLMs (VLMs)—AI systems that look at pictures and talk about them. Most VLMs take the picture, turn it into numbers (a “visual embedding”), then pass those numbers through a small bridge called a connector so a LLM can understand them. The big question: does this connector accidentally throw away useful visual information?

The main questions, in simple terms

  • When a VLM turns image features into a form the LLM can read, how much detail is lost?
  • Does this loss of detail make the model worse at tasks like finding similar images, writing captions, or answering questions about pictures?
  • Can we see where in the image that information gets lost?

How the researchers studied it

They used two ideas to measure information loss. Think of an “embedding” as a compact digital summary of an image, like a special fingerprint made of numbers.

1) Neighborhood comparisons (k-nearest neighbors overlap)

Imagine every image lives on a big map, and each image has its “closest neighbors” (the most similar images). The team checked how an image’s nearest neighbors change:

  • Before the connector (raw visual features).
  • After the connector (features translated for the LLM).

If the “neighborhood” changes a lot after the connector, it means the connector is reshaping the image information, possibly losing important structure. They measure this with a score called k-NN overlap ratio: the higher the overlap, the better the preservation.

Analogy: If your top 10 best friends suddenly become a mostly different group after translation, something changed in how you’re represented.

2) Patch-by-patch reconstruction (finding where details go missing)

Images are split into small tiles called patches. The researchers trained a separate model to reverse the connector: given the connector’s output, try to rebuild the original visual features, patch by patch. If some patches are hard to reconstruct, that suggests the connector discarded or scrambled details there.

Analogy: It’s like compressing a photo and then trying to uncompress it. Where the “uncompression” fails, you likely lost detail.

Models and tasks they tested

They tested three popular, open models with different connector designs:

  • LLaVA (uses an MLP connector and keeps all patches).
  • Idefics2 (uses a Perceiver-like connector that compresses patches).
  • Qwen2.5-VL (merges patches; also continues training the vision part).

They evaluated on:

  • Finding similar images (image retrieval).
  • Writing captions (COCO, Flickr30k).
  • Answering questions about images, including a dataset that marks which image regions matter for the answer (VizWiz Grounding VQA).

What they discovered and why it matters

  • Connectors change neighborhoods a lot. Across datasets and models, 40–60% of the nearest neighbors changed after the connector. This means the connector significantly reshapes how images are organized in the model’s “map.”
  • This reshaping often hurts image retrieval. When neighborhoods change more, the model gets worse at finding the right similar images—there’s a measurable drop in retrieval scores for LLaVA and Idefics2.
  • One exception: Qwen2.5-VL. Even though its neighbor overlap was low, its post-connector features sometimes worked better for retrieval. The authors suspect this is because Qwen2.5-VL keeps training its vision encoder and merges patches, so the “after” space is just a different, possibly more semantic space—not a simpler copy of the “before” space.
  • Patch-level loss predicts performance. When reconstruction shows high loss in important patches (the regions that matter for the answer), models are more likely to get visual questions wrong. For captioning, images with higher reconstruction loss tend to get worse scores, especially for LLaVA and Idefics2.
  • Visual explanations help diagnose failures. The patch-level maps highlight exactly where information is lost, making it clearer why a model missed a detail (for example, failing to read a specific number in the image).

What this means going forward

Connectors are not just simple translators—they can reshape and sometimes degrade visual information that the LLM relies on. This research suggests:

  • Good connectors should keep meaningful image relationships intact or improve them in a thoughtful way.
  • They should preserve the parts of the image that matter for the current task (like text regions when the question asks about a number).
  • Training methods could include a “reconstruction loss” (a penalty when important details can’t be recovered) to encourage connectors to retain key information.
  • Future designs might use smarter, dynamic projection layers that adapt to the image and the question, or better ways to select which visual features to pass on.

In short: the bridge between seeing and talking matters a lot. Measuring and reducing information loss at that bridge can make multimodal AI both smarter and more reliable.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored, framed to be directly actionable for future research.

  • Metric validity and alternatives
    • Validate whether low kk-NN overlap necessarily indicates harmful information loss, given cases like Qwen2.5-VL where overlap is low yet retrieval improves.
    • Compare KNOR with manifold-preservation metrics (e.g., trustworthiness/continuity, rank-based Kendall’s tau, CCA-based similarity, Centered Kernel Alignment) and isotropy measures (e.g., IsoScore) to test robustness.
    • Assess sensitivity of KNOR to choices of distance metric (cosine vs L2L^2), normalization, dataset size, class balance, and kk values beyond 10/50/100.
    • Provide formal statistical testing and confidence intervals for KNOR across datasets and models, not only correlations on a per-sample basis.
  • Causality vs correlation
    • Establish causal links between connector-induced changes and downstream errors via controlled interventions (e.g., patch-level feature masking, adversarial modifications, connector regularization during fine-tuning).
    • Run ablations where only the connector is altered (keeping vision encoder/LLM frozen) vs end-to-end co-training to isolate the source of observed effects.
  • Scope of models and architectures
    • Extend analysis beyond three connector-based VLMs to cover more connector variants (e.g., InternVL/InternVL2, LLaVA variants, LION, Gemini-like systems) and different compression ratios.
    • Include cross-attention fusion architectures (e.g., BLIP-2-style, Flamingo-like) to determine whether the documented information loss is connector-specific or a general property of multimodal fusion.
    • Evaluate scaling effects by testing small and very large models (>30B) to characterize how model size modulates information loss.
  • Reconstruction methodology and identifiability
    • Disentangle whether reconstruction error reflects true information loss vs limits of the trained reconstructor; quantify identifiability in many-to-one mappings (e.g., when connectors compress sequences).
    • Benchmark multiple reconstruction families (MLP/Transformer/U-Net, diffusion-based inversion, invertible networks) and report capacity–performance curves for all three VLMs (not only LLaVA).
    • Measure other reconstruction fidelity metrics beyond MSE/L2 norm differences (e.g., cosine similarity at patch level, feature-level LPIPS, CLIP score between original and reconstructed visual features).
    • Provide theoretical bounds or identifiability conditions under which pre-projection embeddings are recoverable from post-projection embeddings.
  • Dependence on training data and generalization
    • Test reconstruction models trained on COCO against out-of-domain datasets (medical, satellite, documents) to assess generalization of the inferred “loss maps.”
    • Investigate whether reconstruction models overfit to COCO statistics; include cross-dataset and few-shot evaluation of reconstructor robustness.
  • Patch-level grounding and relevance
    • Go beyond VizWiz masks by using diverse grounding sources (RefCOCO/RefCOCO+, PhraseCut, Visual Genome) to validate that high-loss regions consistently align with task-critical regions.
    • Compare multiple patch-importance estimators (e.g., Grad-CAM/attention rollout/attribution maps from the LLM) to verify that “relevant” patches are correctly identified and that loss in those patches best predicts errors.
  • Task coverage and evaluation breadth
    • Extend beyond captioning/VQA/retrieval to fine-grained recognition, OCR/text-in-image, referring expression comprehension, detection/segmentation, and robustness (occlusion, low-light, blur).
    • Evaluate instruction-following tasks requiring multi-hop visual reasoning to see if loss concentrates on compositional cues.
  • Connector design and training signals
    • Systematically vary connector architectures (depth, width, bottleneck size, residual/skip connections, attention vs MLP) and quantify how each design choice impacts KNOR and reconstruction loss.
    • Integrate reconstruction-based regularizers (e.g., embedding-level distillation, rate–distortion objectives, Jacobian penalties) during connector training and report improvements in geometry preservation and downstream tasks.
    • Explore dynamic or content-adaptive connectors (e.g., routing, conditional computation, mixture-of-experts, token pruning guided by textual context) to preferentially preserve task-relevant visual features.
  • Role of the vision encoder
    • Analyze the impact of freezing vs unfreezing the vision encoder on KNOR and reconstruction (e.g., Qwen2.5-VL continues to pretrain the vision encoder).
    • Quantify how different vision encoders (CLIP, SigLIP, ConvNeXt, Swin, EVA) affect pre/post projection geometry and the connector’s learned mapping.
  • Interaction with the LLM and flattener
    • Measure how the flattener ordering and positional encodings affect post-projection utility; test whether reordering tokens or altering positional schemes changes downstream performance.
    • Evaluate information preservation not only at the connector output but after the first few LLM layers to capture how the LLM transforms and potentially discards visual signals.
  • Theoretical characterization
    • Develop information-theoretic measures (e.g., mutual information estimates, rate–distortion trade-offs) to complement geometric metrics and provide principled notions of “loss.”
    • Analyze connector mappings through Jacobian spectra, Lipschitz constants, and volume-preservation (or collapse) to relate local geometry distortions to global structure changes.
  • Procrustes and manifold alignment
    • Go beyond linear Procrustes by testing nonlinear manifold alignment (e.g., autoencoder alignment, optimal transport) to quantify whether nonlinearity suffices to reconcile pre/post spaces.
    • Compare alignment errors across pooled vs token-level embeddings and across multiple pooling strategies (mean/max/attention pooling).
  • Retrieval evaluation details
    • Report full inner-product vs L2L^2 retrieval comparisons on CUB (and other retrieval datasets) in the main text; evaluate sensitivity to index settings (FAISS variants) and gallery size.
    • Conduct class- and attribute-level error analysis to determine which fine-grained attributes are most affected by connector-induced loss.
  • Robustness and uncertainty
    • Test metric stability under input perturbations (noise, blur, compression, adversarial edits) to see whether connectors amplify vulnerability to distribution shifts.
    • Quantify uncertainty in reconstruction (e.g., via ensembles or Bayesian models) and examine whether high predictive uncertainty correlates better with downstream failures than raw reconstruction error.
  • Practicality and efficiency
    • Assess the computational cost of large reconstructors (up to 844M params) and develop lightweight, training-free proxies for estimating information loss (e.g., local intrinsic dimensionality).
    • Provide guidelines for estimating loss during pretraining without expensive auxiliary models.
  • Reporting and reproducibility
    • Release per-sample KNOR and loss maps for all datasets to enable meta-analysis; document seeds, data splits, and hyperparameters for KNOR, retrieval, and reconstruction.
    • Perform significance testing for model–dataset differences in KNOR and reconstruction metrics, not only per-sample correlations.
  • Open question on metric alignment with human utility
    • Determine how well KNOR and reconstruction loss predict human-perceived failures in explanations or visual reasoning; include human studies or expert annotations linking loss regions to missed concepts.
  • Pixel-level reconstruction and perceptual fidelity
    • Revisit pixel-level inversion with stronger generative priors (e.g., diffusion-based inversion, high-capacity VAEs, NeRF-style priors) and larger/cleaner training sets to test whether connector outputs retain sufficient information for faithful image recovery.
  • Time and video
    • Extend to video-LLMs to measure whether connectors disproportionately discard temporal cues or motion features relative to spatial content.
  • Multi-image and document settings
    • Evaluate multi-image/context fusion and document understanding (tables/charts) where spatial layout and fine-grained symbols are critical, testing whether connectors lose structured relations.

These gaps collectively point to the need for broader architectural coverage, causal experiments, more principled metrics, richer tasks, and targeted connector training strategies that explicitly preserve task-relevant visual information while improving semantic alignment.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 10 tweets with 239 likes about this paper.