Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data (2509.03501v1)

Published 3 Sep 2025 in cs.CV, cs.AI, cs.HC, and cs.LG

Abstract: Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video LLMs (Video LLMs), while capable of coarse-level comprehension, struggle with fine-grained, spatiotemporal reasoning, especially when user queries rely on time-based event references for temporal anchoring, or gestural cues for spatial anchoring to clarify object references and positions. To bridge this critical gap, we introduce Strefer, a synthetic instruction data generation framework designed to equip Video LLMs with spatiotemporal referring and reasoning capabilities. Strefer produces diverse instruction-tuning data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata, capturing rich spatial and temporal information in a structured manner, including subjects, objects, their locations as masklets, and their action descriptions and timelines. Our approach enhances the ability of Video LLMs to interpret spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions. Without using proprietary models, costly human annotation, or the need to annotate large volumes of new videos, experimental evaluations show that models trained with data produced by Strefer outperform baselines on tasks requiring spatial and temporal disambiguation. Additionally, these models exhibit enhanced space-time-aware reasoning, establishing a new foundation for perceptually grounded, instruction-tuned Video LLMs.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a modular data engine that generates 947,854 spatiotemporally grounded instruction-response pairs to boost fine-grained video reasoning.
  • It leverages a combination of Video LLMs and pixel-level models like GroundingDINO and SAM2 to create detailed masklets and temporal annotations.
  • Results show reliable benchmark improvements on spatial-temporal tasks, while indicating future work is needed for occlusion handling and long-range dependencies.

Strefer: Synthetic Instruction Data for Space-Time Referring and Reasoning in Video LLMs

Motivation and Problem Statement

The Strefer framework addresses a critical gap in Video LLMs: the inability to resolve fine-grained spatial and temporal references in dynamic, real-world video environments. While recent Video LLMs demonstrate progress in general video understanding, they lack the granularity to track and reason about specific object states, movements, and temporal relations, especially when user queries involve explicit space-time references (e.g., "What is the man in the red shirt doing at 00:01:23?"). This limitation is primarily due to the scarcity of instruction-tuning data that is both object-centric and spatiotemporally grounded.

Strefer Data Engine: Architecture and Pipeline

Strefer introduces a modular, fully-automated data engine for generating synthetic instruction-response pairs that are richly grounded in both space and time. The pipeline orchestrates multiple open-source, pre-trained models—Video LLMs, LLMs, and pixel-level vision foundation models (e.g., RexSeek, GroundingDINO, SAM2)—to pseudo-annotate videos with temporally dense, object-centric metadata. This metadata includes active entities, their locations as masklets (segmentation masks tracked over time), and their action descriptions and timelines. Figure 1

Figure 1: Example of Strefer-annotated instruction-response pairs and video metadata, showing masklets and action timelines for multiple entities in complex video scenarios.

The pipeline consists of the following key modules:

  • Entity Recognizer: Identifies all active entities in the video using a Video LLM, guided by explicit prompts to distinguish between similar entities.
  • Referring Parser: Extracts structured lists of entity-referring expressions, noun categories, and generalized categories from the recognizer's output.
  • Referring Masklet Generator: Generates masklets for each referring expression by combining GroundingDINO (for object detection with generalized nouns), SAM2 (for bidirectional tracking and segmentation), and RexSeek (for disambiguating multi-word referring expressions and assigning them to the correct masklets). This module robustly handles cases where entities are absent in the first frame or temporarily leave and re-enter the scene. Figure 2

    Figure 2: Overview of the Referring Masklet Generation Pipeline, addressing multi-entity, occlusion, and re-entry challenges.

  • Video Clipper: Segments videos into semantically distinct clips using PySceneDetect and SigLIP-based hierarchical clustering, enabling temporally dense annotation.
  • Video Transcriber: Generates behavior-centric descriptions for each entity in each clip, focusing on explicit, observable dynamics.
  • Instruction Data Generator: Synthesizes instruction-style QA pairs using both templates and LLMs, covering a diverse set of spatiotemporal reasoning tasks. Questions are designed to require masklet and/or timestamp references, and language-based references are replaced with pronouns or generic terms to enforce reliance on visual grounding.

Data Composition and Task Coverage

Strefer generated 947,854 instruction-response pairs from only 4,253 NExT-QA videos (average 40 seconds, up to 3 minutes). The data is organized into 8 groups, covering 11 distinct question task types, including:

  • Mask-referred regional description and QA
  • Timestamp-referred video QA
  • Temporal ordering and event sequencing
  • Entity behavior captioning within specific time intervals
  • Multi-entity and multi-masklet reasoning Figure 3

    Figure 3: Data composition of the final Strefer recipe, illustrating the diversity of mask/timestamp-referred and template/LLM-generated tasks.

Model Integration and Training

Strefer data can be used to instruction-tune general-purpose Video LLMs with or without architectural modifications:

  • With Architectural Enhancements: Plug-and-play modules (Region-Language Connector, Timestamp Conversion) are added to support fine-grained masklet and timestamp comprehension. The spatiotemporal object encoder from VideoRefer and temporal token learning from GroundedLLM are incorporated.
  • Without Architectural Changes: Visual prompting methods such as SoM (mask overlay) and NumberIt (frame ID overlay) are explored for training-free adaptation.

The model is fine-tuned on a base recipe (BLIP-3-Video + VideoRefer-700K) with Strefer data added. The visual encoder is kept frozen due to data limitations. Training uses 32 frames per video and 32 temporal tokens, with a total of 4B parameters and 3×8 H200 GPUs for one day.

Quantitative and Qualitative Results

Strefer-trained models consistently outperform baselines on benchmarks requiring spatial and temporal disambiguation:

  • VideoRefer-BenchD (Mask-Referred Regional Description): Average score increases from 3.28 (baseline) to 3.39 (Strefer final recipe).
  • VideoRefer-BenchQ (Mask-Referred Regional QA): Accuracy improves from 0.665 to 0.688.
  • QVHighlights (Timestamp-Referred QA): Accuracy rises from 0.529 to 0.603.
  • TempCompass and VideoMME (Temporal Reasoning): Notable improvements, even though Strefer data is from short videos.

Ablation studies reveal that the inclusion of mask/timestamp-referred queries and negative questions is critical for performance. However, adding event sequencing data can trade off fine-grained spatial-temporal understanding for higher-level temporal abstraction. Figure 4

Figure 4

Figure 4: Qualitative comparison on VideoRefer-BenchD. Strefer-trained model correctly identifies the table, while the baseline misinterprets the mask as a person.

Figure 5

Figure 5

Figure 5: Strefer-trained model demonstrates superior timestamp-based reasoning on QVHighlights, accurately localizing and describing events.

Additional qualitative results show that Strefer-trained models mitigate foreground and center bias, handle multi-entity disambiguation, and exhibit improved fine-grained action understanding.

Limitations and Future Directions

Strefer's synthetic data is not error-free; pseudo-annotations may miss occluded entities or contain hallucinations from underlying models. The masklet generator, while robust, still struggles with heavy motion blur and long-range dependencies. The modular pipeline's reliance on multiple models may pose reproducibility challenges for resource-constrained groups, but components can be swapped for efficiency.

Current models are limited to mask-based spatial referring; future work should extend to points, boxes, and scribbles, which can be derived from masks. Output-level spatiotemporal grounding remains an open challenge, requiring higher-fidelity annotations. Systematic optimization of data composition and leveraging larger LLM backbones are promising directions.

Implications and Outlook

Strefer establishes a scalable methodology for generating high-utility, spatiotemporally grounded instruction data without manual annotation or proprietary models. This enables Video LLMs to resolve complex, real-world queries involving object-centric events and temporal references, a prerequisite for advanced applications in navigation, surveillance, and interactive robotics.

The framework's modularity ensures that as vision and language foundation models improve, Strefer's data quality and coverage will correspondingly advance. The approach also highlights the importance of data specificity and task diversity over sheer volume for effective instruction tuning.

Conclusion

Strefer provides a principled, scalable solution for equipping Video LLMs with fine-grained space-time referring and reasoning capabilities. By synthesizing instruction-response pairs grounded in dense, object-centric video metadata, Strefer-trained models achieve superior performance on spatial and temporal disambiguation tasks. This work lays the foundation for perceptually grounded, instruction-tuned Video LLMs capable of sophisticated, context-aware reasoning in dynamic environments. Future research should focus on output-level grounding, broader spatial reference modalities, and systematic data mixture optimization to further enhance generalization and robustness.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com