Effective prompting/formulation for baseline models on video spatio-temporal pointing

Develop a prompting and output-format formulation that enables baseline video–language models such as GPT-5, Gemini 3/2.5, and Qwen3-VL to achieve very strong performance on the Molmo2-VideoPoint spatio-temporal pointing benchmark, where the task requires returning precise timestamps and pixel locations for objects or events across video frames.

Background

The paper introduces Molmo2-VideoPoint (Molmo2-VP), an evaluation built by pairing annotated spatio-temporal points with segmentation masks produced by SAM 2, and measures F1, recall, and precision for video pointing. The authors compare Molmo2 models against strong proprietary and open-weight baselines.

Despite careful prompt tuning and trying both point and bounding-box output formats for baseline models, the authors report that the baselines did not yield very strong performance on Molmo2-VP. Gemini Pro 3.0 achieved the best baseline score (F1=20.0), but Molmo2 models substantially outperformed it (F1≈38–40). This leaves the formulation challenge unresolved for non-Molmo baselines.

References

For Molmo2-VP, we carefully tune the prompts and try both point and bounding-box formats for our baseline models; however, we were unable to find a formulation that achieved very strong performance.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding  (2601.10611 - Clark et al., 15 Jan 2026) in Evaluation, Grounding results subsection (Video counting and pointing)