GenTel-Bench: Geo-Temporal Benchmark for VLMs
- GenTel-Bench is a unified benchmark for geo-temporal reasoning in VLMs, integrating spatial map and temporal video cues.
- Built on 420 curated tasks from multi-camera datasets, it rigorously evaluates models’ performance in tracking, forecasting, and event ordering.
- The evaluation reveals significant challenges in temporal forecasting and map-video alignment, highlighting opportunities for advanced model improvements.
GenTel-Bench is a unified benchmark for geo-temporal reasoning in vision-LLMs (VLMs). Designed to rigorously assess models’ abilities to jointly reason over spatial and temporal cues across multi-camera, multi-modal inputs, GenTel-Bench targets core real-world tasks such as tracking, forecasting, and event ordering in the context of smart-city or surveillance applications. Its construction bridges the historically separate domains of map-based geographic intelligence and egocentric video/event analysis, thereby providing a comprehensive test bed for evaluating the spatial-temporal generalization capabilities of advanced VLMs (Xie et al., 9 Oct 2025).
1. Motivation and Problem Scope
Traditional benchmarks isolate either static geographic reasoning (e.g., transit maps) or sequential/event-centric video understanding. However, modern applications—ranging from autonomous traffic management to emergency response—require fusion of graphical map data with visual input from non-overlapping camera networks. GenTel-Bench is constructed to fill this critical evaluation gap: models must (a) switch between 2D cartographic and video perspectives, (b) integrate partial views from spatially distributed cameras, and (c) make inferences about spatial regions and temporal intervals not directly observed in any modality. The suite's central ambition is to support precise, multi-perspective benchmarking for next-generation city-scale spatial-temporal intelligence (Xie et al., 9 Oct 2025).
2. Dataset Construction and Annotation
GenTel-Bench comprises 420 curated tasks derived from 364 video clips sampled from two multi-camera tracking datasets:
- CityFlow (Outdoor): 31 cameras deployed across an urban landscape, separated by an average of ~985 meters; 40 unique vehicle identities form the basis for cross-scene queries.
- MTMMC (Indoor): 16 cameras mapped onto three building floors, inter-camera distance ~31 meters; 3,669 pedestrian identities.
For each item, the input bundle includes:
- A top-down map visualizing camera nodes, field-of-view (FoV) cones, connectivity, spatial scale, and orientation.
- Up to three short video clips (total ≤ 20 frames), each annotated with bounding boxes and nanosecond timestamps.
- Metadata from homography-based camera calibration (3×3 projective transform matrix per camera), enabling backprojection of 2D bounding box coordinates into real-world (map) space.
Trajectory preprocessing includes outlier removal and linear interpolation. Each query is paired with a text "Motion Summary" synthesized by an LLM. Distractor response options for all-choice tasks are generated by randomized sampling—alternate cameras or path/room choices—ensuring even distribution and avoidance of spurious patterns.
3. Task Suite and Formal Definitions
GenTel-Bench organizes queries into seven well-defined tasks split between basic and combinatorial reasoning:
| Level | Task Name | Description |
|---|---|---|
| Basic (MCQ) | Geo-Location (GL) | Identify intermediate camera(s) traversed by a target |
| Basic (MCQ) | Arrival Time-Interval (ATI) | Infer time interval for target arrival at a camera |
| Basic (MCQ) | Motion-State (MS) | Describe speed and direction over an intermediate segment |
| Combinatorial | Causal Reordering (CR) | Chronologically order a shuffled sequence of video clips |
| Combinatorial | Next Spot Forecasting (NSF) | Predict next camera and timestamp for subsequent observation |
| Combinatorial | Trajectory Forecasting (TF) | Forecast subsequent cameras and time intervals |
| Combinatorial | Multi-Target Trajectory Forecasting (MTTF) | Predict meeting point and time for two targets |
Let a query , where is a set of video clips (with per-frame timestamps), the map context, and a global interval. Answers for MCQ tasks are , and MCQ accuracy is
For predictive (spatio-temporal) tasks, spatial-temporal Intersection over Union (ST-IoU) is used:
where , are predicted camera and interval, 0, 1 are ground truth (Xie et al., 9 Oct 2025).
4. Experimental Design and Protocol
All model evaluations in GenTel-Bench are performed under strict zero-shot conditions—no fine-tuning on benchmark queries or related context. The 420 items are stratified evenly by environment (210 outdoor, 210 indoor) and by task (60 queries per task).
Model prompts include the full map graphic and clipped video frames; total frame count per query is capped at 20. Evaluation involves both accuracy-based and ST-IoU-based metrics, as delineated above.
Twelve state-of-the-art VLMs are included in baseline runs, both proprietary (e.g., GPT-4o, GPT-5, Gemini-2.5-Pro) and open-source (InternVL3-{2B,8B,38B}, Qwen2-VL and Qwen2.5-VL in three sizes, GLM-4.1V-9B). Prompting is deterministic (temperature 0.1), sequence length up to 16,384 tokens.
5. Baseline Performance and Observed Patterns
Aggregate results reveal that even the strongest proprietary VLMs perform far below human level. Gemini-2.5-Pro, the best performing API, achieves only 34.9% average score (MCQ accuracy and ST-IoU), whereas human experts reach 78.61%. The strongest open-source model, InternVL3-38B, posts 30.76%. Outdoor tasks generally yield higher scores due to regularity in road connectivity; however, certain combinatorial/forecasting tasks favor indoor/floor-plan context, presumably because of increased semantic anchoring.
Performance degrades sharply from basic MCQ tasks to combinatorial/predictive tasks, indicating substantial difficulty for current models in multi-step, cross-modal spatial-temporal inference.
6. Diagnosis of Model Deficiencies
Systematic ablations and prompt-level self-analysis identify three categorical weaknesses:
- Context Imbalance: Models inconsistently leverage available modalities. For example, InternVL3-38B shows strong spatial reasoning but deficient temporal inference; overall, spatial cues are over-utilized relative to temporal markers.
- Temporal Forecasting Deficiency: Accuracy on spatial-only forecasting (e.g., next camera prediction) remains adequate, but performance drops 20–60 points when temporal intervals must also be jointly predicted, as measured by ST-IoU.
- Map–Video Alignment Errors: Among 177 annotated failure cases,
- Topology errors (28.6%) arise from ignoring real-world connectivity (e.g., one-way paths).
- FoV alignment errors (16.8%) reflect mis-mapping of view direction to map orientation.
- Time-interval estimation (12.4%) and motion-state inference errors (20.4%) result from deficient metric conversion from pixel to world coordinates.
7. Future Research Directions
GenTel-Bench highlights multiple avenues for architectural and training improvements in spatial-temporal reasoning:
- Incorporation of explicit spatio-temporal graph modules encoding both connectivity and semantic topology.
- Map–video alignment layers leveraging homography-based features for joint embedding.
- Temporal forecasting heads attentive to motion summaries and timestamped histories.
- Pretraining curricula utilizing synthetic multi-camera, map-augmented scenarios to ground physical reasoning.
- Multi-camera consistency losses for finetuning, enforcing non-teleportation and real-world kinematic constraints (Xie et al., 9 Oct 2025).
GenTel-Bench thus establishes a rigorous, multimodal, and diagnostically rich evaluation framework for spatial-temporal intelligence in VLMs, with direct applicability to urban analytics, advanced surveillance, and embodied AI systems.