ST-Bench: Spatio-Temporal AI Benchmark

Updated 25 December 2025

ST-Bench is a comprehensive benchmark suite assessing spatio-temporal reasoning in AI across video, text, and 4D egocentric data.
It evaluates tasks including kinematic motion, temporal sequencing, and geospatial QA with metrics like accuracy, MAE, and IoU.
The benchmarks reveal both significant performance improvements and ongoing challenges in integrating spatial-temporal priors into modern models.

ST-Bench refers to a class of benchmarks that rigorously evaluate the spatio-temporal reasoning capability of AI models across various modalities, domains, and tasks. These benchmarks probe the extent to which models—particularly LLMs and Vision-LLMs (VLMs)—can understand, reason about, and compute over spatial and temporal structures in real-world data, with task formulations ranging from high-level motion analysis in videos to precision spatio-temporal data mining in text-based question answering. The term encompasses several prominent recent efforts, including ST-Bench in the context of kinematic video reasoning (Ko et al., 25 Mar 2025), Ego-ST Bench for egocentric 4D reasoning (Wu et al., 16 Mar 2025), and STBench for comprehensive spatio-temporal QA over geospatial data (Li et al., 2024).

1. Scope and Motivation

ST-Bench fills critical gaps left by prior evaluation suites that either lack temporal dynamics, restrict themselves to static or synthetic settings, or narrowly assess memorized spatial-temporal facts. Its central rationale is to provide a controlled, multi-faceted suite of question-answering (QA) tasks that necessitate integrated reasoning about the evolution of states or entities over time and space. By covering both kinematic reasoning from realistic videos (including vehicle and human trajectories), egocentric 4D understanding, and large-scale spatio-temporal QA in tabular or textual form, ST-Bench supports a spectrum of research directions in dynamic scene understanding, AI for physical reasoning, and spatio-temporal data mining (Ko et al., 25 Mar 2025, Wu et al., 16 Mar 2025, Li et al., 2024).

2. Dataset Formalisms and Task Design

Kinematic ST-Bench for Video-Based VLMs

The ST-Bench developed in (Ko et al., 25 Mar 2025) (sometimes referred to as STKit-Bench) is the first benchmark targeting explicit kinematic spatio-temporal reasoning in VLMs. It comprises:

Videos and Annotations: 1,400 QA evaluation pairs (200 per task) drawn from real-world, multi-view RGB+3D datasets—covering autonomous driving (NuPlan, NuScenes, Argoverse2) and multi-agent sports (Ego-Exo4D), as well as pseudo-labeled video sources without native 3D.
Sampling and Labeling: Videos sampled at 2 Hz across 20 s, producing sequences of 3D object centers, bounding boxes, and timestamps; both LiDAR/VIO-annotated and pseudo-labeled (via 4D geometric-sensor fusion).
Kinematic Tasks
- Traveled Distance: $d = \|\mathbf{p}_t - \mathbf{p}_0\|_2$ (meters)
- Speed: $v = d/\Delta t$ , with $\Delta t = t-0$
- Movement Direction: $\hat{\mathbf{u}} = (\mathbf{p}_t - \mathbf{p}_0)/d$ , discretized into 12 clock sectors
- Comparisons: Pairwise queries on which object traveled farther/faster, or whether objects share a direction sector

Ego-centric ST-Bench for Multimodal LLM Reasoning

Ego-ST Bench (Wu et al., 16 Mar 2025) evaluates multimodal LLMs’ ability to reason about dynamic 4D scenes as experienced in egocentric video:

Composition: 789 video clips (5–20 s each) from a variety of public and self-collected datasets, 5,000+ QA pairs across eight sub-tasks (forward/reverse, spatial, temporal, integrated).
Task Types
- Landmark Description: Identifying static elements and sequence order (multiple choice)
- Action Description: Temporal sequencing of actions
- Direction Change: Multiple-choice directional inference
- Route Description: Open-ended path narration integrating turns, landmarks, and timing

Textual STBench for Spatio-Temporal Data Mining

STBench (Li et al., 2024) interrogates LLMs’ spatio-temporal QA and computational capabilities:

Composition: Over 60,000 QA pairs across 13 diverse tasks, encompassing knowledge comprehension (e.g., POI categorization), spatio-temporal reasoning (e.g., point-region-trajectory inference), computation (direction, encounter counting), and non-trivial downstream applications (anomaly detection, trajectory classification, next-point prediction).
Formal Definitions: All tasks precisely specify data structures (trajectories, points, polygons) and, where applicable, refer to explicit mathematical operations (e.g., azimuth, spatial-temporal intersection).

3. Annotation, Pseudo-Labeling, and Data Generation

Video-based ST-Bench

3D Label Sources: Real-world datasets employ ground-truth 3D annotation via LiDAR or VIO/SLAM systems.
Pseudo-Labeling: For datasets lacking 3D, a multi-stage pipeline is applied:
- Geometric Reconstruction: MonST3R provides per-frame depth and poses (scale-ambiguous), refined by Metric3Dv2 for metric depth and canonical scaling.
- Semantic Understanding: Grounded-SAM2 segments and tracks moving objects.
- 2D→3D Lifting: 2D segmented masks are lifted into canonicalized 4D point clouds for barycenter extraction, with trajectories smoothed and filtered by mask confidence and bounding box size.
- Validity Constraints: Only trajectories passing mask confidence and minimum visible size thresholds are retained.
- Distance Metric: $\text{distance} = \sum \|\mathbf{p}_{t+1} - \mathbf{p}_t\|$ .

QA Generation for LLM Benchmarks

Ego-ST Bench: Annotations are curated in both forward (video→answer) and reverse (end-state→path reconstruction) form; open-ended tasks are scored following structured rubrics.
STBench (Textual): Most samples are generated using a combination of template instantiation and crowdsourcing/automation; system prompts structure chat-based LLM evaluation.

4. Evaluation Metrics and Experimental Protocols

Kinematic Video ST-Bench

Traveled Distance & Speed: Accuracy (predictions $\hat{y}$ within $0.75y \leq \hat{y} \leq 1.25y$ ); MAE ( $|\hat{y} - y|$ , m or km/h)
Movement Direction: Accuracy (exact clock sector match); MAE in sector units
Pairwise Comparisons: Yes/no accuracy
Direction Timestamp: IoU-based scoring, with threshold $\geq 0.5$ for accuracy

Ego-ST Bench

Multiple Choice: $\mathcal{ACC} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\{\hat y_i = y_i\}$
Open-Ended: Three sub-scores (directional, landmark, logical/semantic, each 0–5), with overall rubric-based aggregation
Aggregated Metrics: $\mathrm{Avg}_{ST}$ averages route description and direction change; $\mathrm{Avg}_{Total}$ averages over eight subtasks

Textual STBench

Accuracy: Primary metric for all categorical/multiple-choice tasks
Absolute Error: Mean haversine distance for trajectory prediction (TP)
Experimental Methodology: Zero-shot, few-shot in-context learning, chain-of-thought prompting, and fine-tuning (QLoRA); model comparisons span both closed-source (GPT-4o, ChatGPT) and open-source (Llama, Gemma, Mistral, Qwen, etc.) systems

5. Model Performance and Insights

Model	Kinematic ST-Bench Acc.	Ego-ST Bench $\mathrm{Avg}_{ST}$	Textual STBench: Best Acc.
ST-VLM-7B (Ko et al., 25 Mar 2025)	59.8%	NA	NA
GPT-4V (Ko et al., 25 Mar 2025)	28.5%	NA	NA
Qwen2.5-VL-72B (Wu et al., 16 Mar 2025)	NA	57.4	NA
OpenAI-o3-mini (Wu et al., 16 Mar 2025)	NA	41.8	NA
GPT-4o (Li et al., 2024)	NA	NA	95.9% (PCR), 91.9% (PRRD)
Gemma-7B (Li et al., 2024)	NA	NA	Up to 50% (PCR/PI), 90% (PRRD, r=2)

Kinematic ST-Bench: ST-VLM-7B achieves 59.8% accuracy, outperforming GPT-4V and other baselines by over 30 percentage points and reducing MAE in distance tasks (from ~33 m to 25 m). Improvements are broad across single-object and multi-object reasoning tasks.
Ego-ST Bench: Open-source models can achieve 57.4% on integrated spatio-temporal tasks, with ST-R1 (CoT+GRPO) raising performance to 86.3%. Open-ended 4D reasoning is notably challenging.
Textual STBench: Closed-source models dominate in knowledge comprehension and single-hop reasoning but underperform (often <15%) on multi-step spatial-temporal relations, accurate numeric computation, and downstream inference. Few-shot and chain-of-thought prompting yield significant gains for larger models; open-source models require fine-tuning to approach competitive performance.

6. Broader Impact and Future Directions

ST-Bench has catalyzed progress in spatio-temporal reasoning by identifying specific deficiencies and bottlenecks in contemporary models:

Kinematic Comprehension: Integrating real 3D labels and pseudo-labels enables robust learning and transfer to dynamic scene understanding; ST-VLM’s broad generalization suggests that spatio-temporal priors enhance video reasoning well beyond kinematics (Ko et al., 25 Mar 2025).
Integrated Multimodal Reasoning: Ego-centric, open-ended, and reverse reasoning tasks expose the limitations of LLMs and VLMs, driving the adoption of chain-of-thought supervision and specialized RL objectives (e.g., GRPO in ST-R1) (Wu et al., 16 Mar 2025).
Spatio-Temporal Data Mining: Precise downstream applications—urban/safety analysis, anomaly detection, epidemiological spread—highlight the applied relevance of these benchmarks, while also revealing the arithmetic and logical reasoning limits of foundation models (Li et al., 2024).

Persistent challenges include multi-step spatial-temporal inference, precise geometric or arithmetic computation, and multimodal integration in unconstrained real-world environments. Recommended advancements include symbolic or differentiable augmentation, richer corpora, and balanced forward/reverse annotation strategies. A plausible implication is that future foundation models will require native integration of spatial-temporal structural priors and advanced curriculum learning to meet the holistic demands posed by ST-Bench.

7. References

"ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-LLMs" (Ko et al., 25 Mar 2025)
"ST-Think: How Multimodal LLMs Reason About 4D Worlds from Ego-Centric Videos" (Wu et al., 16 Mar 2025)
"STBench: Assessing the Ability of LLMs in Spatio-Temporal Analysis" (Li et al., 2024)

Markdown Upgrade to Chat

References (3)

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models (2025)

ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos (2025)

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ST-Bench.