Geo-Attraction Landmark Probing
- GAP is a structured evaluation framework that measures text-to-video models' ability to synthesize geographically grounded tourist attractions with distinct architectural and cultural traits.
- It employs three automatic metrics—Patch-CLIP, Keypoint-Match, and VLM-as-a-Judge—combined with human validation to isolate model fidelity from regional biases.
- Large-scale benchmarking on the GEOATTRACTION-500 dataset reveals minimal regional bias while highlighting opportunities to enhance absolute visual synthesis fidelity.
Geo-Attraction Landmark Probing (GAP) is a structured evaluation framework that quantifies the fidelity with which text-to-video generation models synthesize geographically grounded visual knowledge. By systematically probing models on their ability to recreate tourist attractions—objects characterized by globally recognized architectural and environmental features—GAP disentangles general generative quality from attraction-specific, regionally encoded visual knowledge. Through large-scale benchmarking, knowledge-oriented metrics, and human validation, GAP enables rigorous, interpretable measurement of how geographic information is represented and synthesized by state-of-the-art generative models (Liu et al., 26 Jan 2026).
1. Conceptual Foundation and Motivation
GAP targets the open problem of geo-equity in multimodal generative models, where qualitative and quantitative disparities in the fidelity of synthesized visual content could potentially reflect or perpetuate global biases. Tourist attractions are selected as proxies for regional visual identity owing to their well-documented, distinct spatial layouts and culturally salient characteristics. The framework aims to determine whether models trained on broad, predominantly Western, internet-scale datasets encode geographically uniform visual knowledge or instead manifest region-, development-, or popularity-based disparities.
This approach directly addresses the-gap-in-literature regarding geographically grounded evaluation contrasted with global benchmarks that may conflate model errors across diverse visual and semantic domains.
2. Benchmark Dataset Construction: GEOATTRACTION-500
GEOATTRACTION-500, the core evaluation set in GAP, comprises 500 tourist attractions globally sampled from Google Landmarks Dataset v2. Attractions are stratified by continent to ensure representation: Africa (8), Americas (67), Asia (52), Europe (249), Oceania (5). Socioeconomic stratification is introduced via Global North (271) versus Global South (178) and Global West (302) versus Global East (147) axes. Popularity is proxied by Wikipedia page-views, yielding a long-tailed, log-normal distribution.
Each attraction is categorized into one of ten major types, primarily covering architectural and cultural landmarks to ensure semantic diversity. Reference images, selected by human annotators to capture canonical appearances, serve as the basis for conditioning and evaluation, while two textual prompts per attraction (generated by GPT-5.1)—a concise one-sentence caption and a detailed cinematic description spanning composition, camera, lighting, and style—act as the model’s conditioning input.
| Attribute | Categories/Splits | Quantity/Range |
|---|---|---|
| Continent | Africa, Americas, Asia, Europe, Oceania | 500 total |
| Development Level | Global North/South, West/East | North:271, South:178; West:302, East:147 |
| Popularity | Wikipedia views (log-normal) | Orders of magnitude, long tail |
| Semantic Category | 10 major classes | Dominated by architecture/culture |
3. Evaluation Metrics and Knowledge Probing
GAP incorporates three complementary, automatic knowledge-oriented metrics, each designed to isolate distinct facets of attraction-specific visual knowledge, and one video quality metric. These are validated against independent human judgment.
Patch-Level CLIP (Global Structural Alignment):
Measures the correspondence between local patch embeddings in the ground-truth image and sampled generated frames using CLIP or DINOv2 representations. The metric is defined as:
$\mathrm{Patch\mbox{-}CLIP}(f) =\frac12\Bigl[\frac1n\sum_{i=1}^n\max_j\,\cos(\mathbf{t}_i^{\rm gt},\mathbf{t}_j^f) +\frac1m\sum_{j=1}^m\max_i\,\cos(\mathbf{t}_j^f,\mathbf{t}_i^{\rm gt})\Bigr]$
averaged over frames.
Keypoint-Based Local Alignment (Fine-Grained Fidelity):
Extracts landmark-relevant regions for GroundingDINO+SAM, matches dense keypoints with LoFTR, and computes normalized match density and geometric consistency per region. Fine-grained fidelity is aggregated as:
$\mathrm{Keypoint\mbox{-}Match}(f) = D(f) + G(f)$
with and representing detailness-normalized match density and geometric consistency, respectively.
VLM-as-a-Judge (Semantic Alignment):
GPT-5.1, as a zero-shot vision-LLM, rates both global and fine-grained alignment between instruction, reference, and generated frames on . These outputs are averaged per video.
Human Evaluation:
Independent annotators provide global and local alignment ratings on a 5-point Likert scale for a subset of outputs, with Spearman correlations (Patch-CLIP), $0.44$ (Keypoint-Match), $0.60$ (VLM) confirming metric complementarity. The AIGVE-MACS quality metric is provided as a baseline and is uncorrelated with these knowledge scores.
4. Experimental Protocol and Implementation
GAP is applied to the Sora 2 text-to-video generation model, with each GEOATTRACTION-500 item dictating a single 4s portrait-mode video, prompted by detailed and concise GPT-5.1-generated instructions. Five uniformly sampled frames per video are used for metric computation. DINOv2-Large yields patch embeddings for Patch-CLIP; GroundingDINO+SAM segment regions for Keypoint-Based Local Alignment; LoFTR (outdoor mode) supplies dense correspondences, with normalization constants for detail normalization and for match density scaling.
Prompts and sampling are standardized to ensure cross-region comparability, with both prompt styles (detailed, concise) included to assess the model’s sensitivity to conditioning detail.
5. Core Findings and Interpretations
Contrary to prevailing expectations of Global North or high-popularity bias, Sora 2’s per-region alignment and knowledge scores are remarkably uniform. Correlation coefficients between attraction popularity and automatic metric scores are weak: Patch-CLIP (), Keypoint-Match ($0.10$), VLM ($0.19$), human ($0.23$). Developmental stratifications (North vs. South, West vs. East) yield differences within point on a $0$–$5$ scale.
The impact of prompt detail is modest: Patch-CLIP mean increases , Keypoint , VLM . This suggests that the model’s visually grounded knowledge is robust to prompt elaboration, and, by extension, dataset-driven regional disparities are less pronounced than anticipated.
Qualitative analyses indicate correct silhouette and compositional synthesis for attractions across regions, though occasional inaccuracies occur at the level of structural or textural details.
6. Significance, Limitations, and Prospective Extensions
GAP presents evidence that diffusion-based video generation models can encode geographically uniform, attraction-specific visual knowledge, diverging from trends observed in LLMs, potentially due to the robustness of iterative denoising to training data imbalance. However, average alignment scores (approximately midpoint on $0$–$5$ Likert scales) highlight substantive headroom for absolute fidelity improvement.
A plausible implication is that as generative model scale and deployment expand, systematic frameworks such as GAP will be requisite for early detection of emergent geo-equity gaps and for quantifying the impact of fairness interventions at architectural and dataset levels. Future work may expand coverage beyond static attractions to dynamic, cultural, and environmental phenomena, refine regional and semantic taxonomies, and benchmark heterogeneous video generation architectures to monitor evolving trends in global visual knowledge representation (Liu et al., 26 Jan 2026).