Generative VLM design decisions and their impact on encoder repurposing
Determine whether training and architectural decisions characteristic of generative Vision–Language Models hinder performance when these models are repurposed for representation learning as multimodal encoders for visual document retrieval.
References
While this approach capitalizes on the inherent capabilities of the backbone generative model to address complex visual tasks, it is unclear whether the design decisions made for generative purposes hinder the models when repurposed for representation learning.
— ModernVBERT: Towards Smaller Visual Document Retrievers
(2510.01149 - Teiletche et al., 1 Oct 2025) in Introduction (Section 1)