Generative VLM design decisions and their impact on encoder repurposing

Determine whether training and architectural decisions characteristic of generative Vision–Language Models hinder performance when these models are repurposed for representation learning as multimodal encoders for visual document retrieval.

Background

Recent work repurposes large generative vision–LLMs as multimodal encoders through contrastive post-training for retrieval. However, generative models typically employ decoder-oriented architectures and causal attention masks tailored to next-token prediction, which may not be optimal for embedding tasks. The paper poses this uncertainty at the outset to motivate controlled experiments on attention masking, image resolution, modality alignment objectives, and late interaction mechanisms.

Establishing whether these generative-oriented design choices impede representation quality when converted to encoders is crucial for guiding model selection and training strategies in visual document retrieval, where compact and efficient encoders can rival much larger decoders.

References

While this approach capitalizes on the inherent capabilities of the backbone generative model to address complex visual tasks, it is unclear whether the design decisions made for generative purposes hinder the models when repurposed for representation learning.

— ModernVBERT: Towards Smaller Visual Document Retrievers (2510.01149 - Teiletche et al., 1 Oct 2025) in Introduction (Section 1)

Generative VLM design decisions and their impact on encoder repurposing

Background

References

Related Problems