Effective prompting/formulation for baseline models on video spatio-temporal pointing
Develop a prompting and output-format formulation that enables baseline video–language models such as GPT-5, Gemini 3/2.5, and Qwen3-VL to achieve very strong performance on the Molmo2-VideoPoint spatio-temporal pointing benchmark, where the task requires returning precise timestamps and pixel locations for objects or events across video frames.
References
For Molmo2-VP, we carefully tune the prompts and try both point and bounding-box formats for our baseline models; however, we were unable to find a formulation that achieved very strong performance.
— Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
(2601.10611 - Clark et al., 15 Jan 2026) in Evaluation, Grounding results subsection (Video counting and pointing)