Exploration of visual priors for dynamic modalities (video)

Investigate whether and how visual priors acquired during language-only pre-training extend to dynamic modalities such as video understanding; specifically, determine the contribution of different textual sources to priors supporting temporal reasoning, action recognition, and causality in videos.

Background

The paper systematically studies how LLMs trained only on text acquire visual priors that benefit multimodal tasks, decomposing these priors into perception and reasoning components and analyzing their data origins and scaling behaviors. All empirical evaluations and analyses focus on static images.

The authors explicitly note that their paper does not address dynamic modalities such as video. They point to the need to understand how language-only pre-training might induce visual priors for temporal phenomena, suggesting that certain textual sources (e.g., story-like literature) could be particularly relevant for learning temporal reasoning and causality needed in video understanding.

References

Finally, our study is confined to the domain of static images, leaving the exploration of visual priors for dynamic modalities, such as video understanding, as an open question.

— Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training (2509.26625 - Han et al., 30 Sep 2025) in Section 7: Limitations and Future Research Directions

Exploration of visual priors for dynamic modalities (video)

Background

References

Related Problems