Retention of VLM Visual–Language Representations After VLA Adaptation

Determine to what extent pretrained Vision–Language Models preserve their original visual–language representations and world knowledge after adaptation to the action modality in Vision–Language–Action models via supervised fine-tuning for robotic control.

Background

The paper studies how adapting pretrained Vision–LLMs (VLMs) to robotic action prediction in Vision–Language–Action (VLA) models can alter inherited visual–language (VL) representations. Recent evidence suggests VLA fine-tuning may degrade visual semantics and attention localization, raising concerns about loss of semantic grounding and generalization.

Motivated by this uncertainty, the authors analyze attention maps, latent representation structure, and design VL-Think tasks to probe whether VL capabilities are retained post fine-tuning. They further propose a lightweight visual representation alignment method intended to mitigate such degradation, but the broader question of the extent of retention during VLA adaptation is explicitly identified as unclear.

References

Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved.

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization (2510.25616 - Kachaev et al., 29 Oct 2025) in Section 1: Introduction