Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 166 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 22 tok/s Pro

GPT-4o 88 tok/s Pro

Kimi K2 210 tok/s Pro

GPT OSS 120B 461 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Visual Representations inside the Language Model (2510.04819v1)

Published 6 Oct 2025 in cs.CV and cs.CL

Abstract: Despite interpretability work analyzing VIT encoders and transformer activations, we don't yet understand why Multimodal LLMs (MLMs) struggle on perception-heavy tasks. We offer an under-studied perspective by examining how popular MLMs (LLaVA-OneVision, Qwen2.5-VL, and Llama-3-LLaVA-NeXT) process their visual key-value tokens. We first study the flow of visual information through the LLM, finding that image value tokens encode sufficient information to perform several perception-heavy tasks zero-shot: segmentation, semantic correspondence, temporal correspondence, and referring expression detection. We find that while the LLM does augment the visual information received from the projection of input visual encodings-which we reveal correlates with overall MLM perception capability-it contains less visual information on several tasks than the equivalent visual encoder (SigLIP) that has not undergone MLM finetuning. Further, we find that the visual information corresponding to input-agnostic image key tokens in later layers of LLMs contains artifacts which reduce perception capability of the overall MLM. Next, we discuss controlling visual information in the LLM, showing that adding a text prefix to the image input improves perception capabilities of visual representations. Finally, we reveal that if LLMs were able to better control their visual information, their perception would significantly improve; e.g., in 33.3% of Art Style questions in the BLINK benchmark, perception information present in the LLM is not surfaced to the output! Our findings reveal insights into the role of key-value tokens in multimodal systems, paving the way for deeper mechanistic interpretability of MLMs and suggesting new directions for training their visual encoder and LLM components.