Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence (2510.20579v1)

Published 23 Oct 2025 in cs.CV, cs.AI, and cs.MM

Abstract: Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.

Summary

The paper introduces explicit spatio-temporal grounding to address video reasoning limitations by pinpointing precise timestamps and spatial regions.
It details a two-stage methodology that combines supervised fine-tuning on curated datasets with reinforcement learning using GSPO for optimized accuracy and alignment.
Experimental results reveal significant gains in temporal and spatial grounding, boosting interpretability and overall performance on video QA benchmarks.

Open-o3 Video: Explicit Spatio-Temporal Grounding for Video Reasoning

Motivation and Problem Statement

The Open-o3 Video framework addresses a critical limitation in contemporary video reasoning models: the lack of explicit spatio-temporal grounding in their reasoning outputs. While prior models typically generate textual rationales, they fail to indicate the precise timestamps and spatial regions that constitute the supporting evidence for their predictions. This deficiency impedes interpretability, verifiability, and fine-grained reasoning, especially in dynamic and cluttered video scenes. The paper identifies two principal obstacles: (1) the absence of high-quality datasets providing joint spatio-temporal supervision, and (2) the challenge of training models to localize objects in both time and space with high precision.

Figure 1: Open-o3 Video integrates explicit spatio-temporal grounding into the reasoning process, highlighting key timestamps and object regions that directly support the answer.

Data Construction and Annotation Pipeline

To overcome the data bottleneck, the authors curate two complementary datasets: STGR-CoT-30k for supervised fine-tuning (SFT) and STGR-RL-36k for reinforcement learning (RL). These datasets combine existing temporal-only and spatial-only resources with 5.9k newly annotated high-quality spatio-temporal samples. Each instance includes a question-answer pair, timestamped key frames, localized bounding boxes, and a structured chain of thought linking visual evidence to reasoning steps.

The annotation pipeline leverages Gemini 2.5 Pro for initial annotation, followed by bounding box filtering and self-consistency checking. Noisy or uninformative boxes are removed, and object-box associations are validated using Qwen2.5-VL-7B. Consistency checks ensure that every entity mentioned in the reasoning chain is matched with a valid box and timestamp.

Figure 2: The data construction pipeline combines automated annotation, filtering, and consistency checking to produce balanced spatio-temporal supervision across temporal, spatial, spatio-temporal, and general QA categories.

Model Architecture and Training Paradigm

Open-o3 Video builds upon Qwen2.5-VL-7B and employs a two-stage training paradigm:

Cold-Start Initialization: Supervised fine-tuning on STGR-CoT-30k equips the model with basic spatio-temporal grounding and structured reasoning output. This stage mitigates reward sparsity and stabilizes subsequent RL optimization.
Reinforcement Learning with GSPO: The model is further optimized using Group Sequence Policy Optimization (GSPO), which operates at the sequence level, aligning optimization with sequence-level rewards and stabilizing long-horizon training. The composite reward function incorporates answer accuracy, temporal alignment, and spatial precision.
Figure 3: The two-stage training paradigm: (a) cold-start initialization for structured outputs; (b) RL with composite rewards, adaptive temporal proximity, and temporal gating.

Reward Design

Accuracy Reward: Task-specific, using exact match for MCQ, ROUGE for free-form QA, visual IoU for spatial grounding, and temporal IoU for temporal grounding.
Thinking Reward: Sum of temporal and spatial terms. Temporal reward uses adaptive temporal proximity, annealing the precision requirement over training to avoid reward sparsity. Spatial reward employs temporal gating, computing IoU only when temporal predictions are sufficiently accurate.
Format Reward: Enforces strict usage of structured reasoning formats.

Experimental Results

Open-o3 Video is evaluated on V-STAR, VideoMME, WorldSense, VideoMMMU, and TVGBench. On V-STAR, it achieves state-of-the-art performance, surpassing GPT-4o and Qwen2.5-VL-7B by +14.4% mAM and +24.2% mLGM. The model demonstrates robust gains in temporal and spatial grounding, as well as general video QA and perception-oriented tasks.

Qualitative Analysis

Open-o3 Video provides explicit spatio-temporal evidence in its reasoning traces, enabling confidence-aware test-time scaling. By generating multiple responses with spatio-temporal traces and scoring the relevance of visual regions, the model supports confidence-weighted voting, which outperforms naive majority voting and improves inference reliability.

Figure 4: Confidence-aware test-time scaling leverages spatio-temporal traces for evidence relevance scoring and robust prediction selection.

Figure 5: On appearance perception tasks, Open-o3 Video provides correct answers and explicit spatio-temporal evidence, enhancing interpretability.

Figure 6: For action recognition, the model precisely localizes both the time and location of the action, outperforming Video-R1.

Figure 7: In weather reasoning, Open-o3 Video identifies more effective supporting evidence compared to related models.

Ablation Studies

Ablation experiments confirm the effectiveness of the two-stage training strategy, GSPO over GRPO, and the necessity of high-quality spatio-temporal annotations. Adaptive temporal proximity and temporal gating in the reward function are shown to be critical for coupling temporal and spatial grounding. The inclusion of curated spatio-temporal data yields substantial improvements in grounding metrics.

Implications and Future Directions

Open-o3 Video demonstrates that explicit spatio-temporal grounding is feasible and beneficial for video reasoning models. The framework enhances interpretability, verifiability, and robustness, providing a foundation for evidence-aware inference and scalable test-time strategies. The results suggest that future research should focus on extending spatio-temporal grounding to longer and more complex videos, integrating audio modalities, and developing richer supervision for fine-grained object localization.

Conclusion

Open-o3 Video presents a unified framework for grounded video reasoning, generating explicit spatio-temporal evidence through timestamped frames and localized bounding boxes. The combination of curated data, structured annotation, and GSPO-based RL yields substantial improvements in answer accuracy, temporal alignment, and spatial grounding. The approach sets a new standard for interpretable and verifiable video reasoning, with broad implications for multimodal AI systems.