Exploration of visual priors for dynamic modalities (video)
Investigate whether and how visual priors acquired during language-only pre-training extend to dynamic modalities such as video understanding; specifically, determine the contribution of different textual sources to priors supporting temporal reasoning, action recognition, and causality in videos.
References
Finally, our study is confined to the domain of static images, leaving the exploration of visual priors for dynamic modalities, such as video understanding, as an open question.
— Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
(2509.26625 - Han et al., 30 Sep 2025) in Section 7: Limitations and Future Research Directions