YearGuessr Benchmark: Building-Age Estimation

Updated 27 December 2025

YearGuessr is an open benchmark featuring 55,546 building façade images with detailed metadata for precise building-age estimation across 157 countries.
It employs strict ordinal regression using CORAL-style threshold and FCRC ranking losses to predict construction years from combined visual and geographic inputs.
The benchmark exposes popularity bias by revealing performance gaps between models on frequently recognized landmarks versus more ordinary, long-tail structures.

YearGuessr is the first large-scale, open benchmark explicitly designed to assess building-age estimation and expose “popularity bias” in contemporary vision–LLMs (VLMs). By providing a multi-modal dataset and rigorous ordinal regression protocols, YearGuessr enables systematic evaluation of models’ generalization capabilities beyond the memorization of famous architectural landmarks (Szu-Tu et al., 24 Dec 2025).

1. Dataset Composition

YearGuessr consists of 55,546 unique building façade images sourced from Wikipedia/Wikimedia Commons (CC BY-SA 4.0), spanning 157 countries. Each sample $i$ is annotated with:

Image $I_i$ : a $224 \times 224$ pixel façade crop.
GPS coordinates $g_i = (\phi_i, \lambda_i)$ , with 100% coverage.
Wikipedia page-view count $p_i$ (summed 01 Jul 2023–01 Jul 2024), serving as a quantification of "popularity."
Complete textual description (median length: 2,240 characters) and country via reverse geocoding.
Construction year label $y_i \in \{1001, \ldots, 2024\}$ .

Geographically, the dataset is 63.3% from the Americas, 22.5% Europe, 6.3% Asia, and the remainder from Africa/Oceania. Temporally, labels are continuous between 1001 and 2024 CE, with a long-tailed log-scale distribution (notably, >10% predate 1600 CE; major peaks in the 18th–20th centuries).

2. Ordinal Regression Task Formulation

YearGuessr frames construction year prediction as strict ordinal regression. Each model $f_\theta$ processes $(I_i, g_i)$ and outputs a scalar year estimate $\hat y_i = f_\theta(I_i, g_i)$ . Evaluation uses:

Mean Absolute Error (MAE):

$\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N} \lvert y_i - \hat y_i \rvert$

Interval Accuracy (IA), for $k \in \{5, 20, 50, 100\}$ years:

$\mathrm{IA}_k = \frac{1}{N}\sum_{i=1}^{N}\mathbf{1}\bigl[\,\lvert y_i-\hat y_i \rvert \leq k \bigr]$

Two complementary training paradigms are used:

(a) CORAL/CORN-style Threshold Loss:

For each $i$ , real-valued scores $o_{i,k}$ are predicted for binary predicates (year $>k$ ), with a sigmoid cross-entropy across thresholds:

$\mathcal{L}_{\mathrm{ord}} = -\sum_{i=1}^N\sum_{k=1001}^{2023} \Bigl[ \mathbf{1}(y_i>k)\log\sigma(o_{i,k}) + (1-\mathbf{1}(y_i>k))\log(1-\sigma(o_{i,k})) \Bigr]$

The final prediction is

$\hat y_i = 1001 + \sum_{k=1001}^{2023}\mathbf{1}\bigl(\sigma(o_{i,k})>0.5\bigr).$

(b) FCRC Ranking-based Contrastive Loss (from NumCLIP):

This loss penalizes out-of-order image/text embeddings $(z_i, w_i)$ , weighting negatives with $\lambda_{ij} \propto |y_i-y_j|$ :

$\mathcal{L}_{\mathrm{FCRC}} = -\frac{1}{M}\sum_{i=1}^M \log \frac{ \exp[\cos(z_i, w_i)/\tau] }{ \exp[\cos(z_i, w_i)/\tau] + \sum_{j\neq i}\lambda_{ij}\exp[\cos(z_i, w_j)/\tau] }$

3. Popularity-Aware Evaluation Protocols

To quantify memorization effects tied to landmark popularity, YearGuessr stratifies test samples by $p_i$ into popularity bins:

"Ordinary" ( $p_i < 10^2$ )
"Popular" ( $p_i > 10^5$ )
Finer: $[<10^2, 10^2$ – $10^3, 10^3$ – $10^4, 10^4$ – $10^5, >10^5]$

For any subset $S$ (e.g., "Popular"), interval accuracy is

$\mathrm{IA}_k(S) = \frac{1}{|S|}\sum_{i \in S} \mathbf{1}[|y_i - \hat y_i| \leq k]$

The "popularity gain" metric is defined as

$\Delta_k = \mathrm{IA}_k(\text{Popular}) - \mathrm{IA}_k(\text{Ordinary})$

A continuous variant $\mathrm{IA}_k^p$ weights by $p_i$ , but the principal focus is on interpretable bin-wise splits. This protocol exposes models’ tendency to perform disproportionately well on highly viewed, frequently pre-trained landmarks.

4. Benchmarked Models and YearCLIP Design

Evaluation covers 43 methods across several model families:

Model Family	Notable Examples	Distinguishing Properties
CNNs	ResNet-50/152, ConvNeXt-baseline/large	Pure vision; no language prior
Pure Transformers	ViT-B/16, Swin-B	Token-mixing, no explicit language modeling
CLIP-based	Zero-shot CLIP, GeoCLIP, NumCLIP, YearCLIP	Joint vision-language, varying geo/numeric features
Closed-source VLMs	GPT-4-Vision-mini, Gemini, Claude 3, Grok 2	Large, often proprietary, pre-trained on web-scale
Open-source VLMs	CogVLM2, Gemma3, GLM-4V-9B, LLaVA variants	Community-accessible, hybrid language/vision

YearCLIP extends CLIP by using frozen image and text encoders, with additional architectural innovations including:

Location Conditioning: Random Fourier Features (RFF) with MLP and a learnable "zero convolution" fuse geographic and visual information.
Coarse-to-Fine Style Classification: Seven historical style tokens (e.g., Roman, Gothic, Contemporary) provide architectural cues.
Reasoning Prompts: Approximately 20 architectural sub-tokens (roof, wall, height) enrich predictions and enable post-hoc rationalization.
Trainable Regressor: Ingests similarity scores from style and reasoning prompts for ordinal-regressed year output.

Training incorporates FCRC loss for ordering, cross-entropy for style classes, and an optional regression penalty.

5. Empirical Findings on Popularity Bias

Significant empirical findings are as follows:

Standard Accuracy: YearCLIP attains MAE ≈ 39.5 years, IA₅ ≈ 19%, IA₁₀₀ ≈ 91.6%, a 13.5% MAE improvement over GeoCLIP (45.7 years). Gemini 1.5-Pro (MAE 33.1, IA₅ 28.2%), Gemini 2.0-Flash, and Grok 2 are the strongest closed models.
Popularity Gap: Pure vision models generalize better on “ordinary” buildings but exhibit negative $\Delta_5$ $Δ_{5}$ :
- ConvNeXt-B: IA₅ drops from 16.6 (Ordinary) to 12.7 (Popular), $\Delta_5 = -3.9$
- Swin-B: 15.8→6.8, $\Delta_5 = -9.0$
- Closed-source VLMs display strong memorization with large positive $\Delta_5$ :
- Gemini 2.0-Flash: IA₅ jumps from 24.2 to 58.4, $\Delta_5 = +34.2$
- Gemini 1.5-Pro: $\Delta_5 = +16.6$ , Grok 2: $+16.7$
- Open-source LLM/VLM hybrids yield positive but smaller $\Delta_5$ .
Regional and Temporal Patterns: All models are most accurate in the Americas (lowest MAE), less so in Africa/Europe; performance increases with proximity to the 19th–20th centuries, with MAE >400 years in the 1000–1150 CE interval.
The results collectively demonstrate that current VLMs often rely on memorized entries from pre-training, particularly for high-popularity buildings, rather than genuinely learning to interpret architectural signals; vision-only models avoid this but at the cost of lower peak accuracy.

6. Real-World and Methodological Implications

YearGuessr surfaces an over-reliance on landmark memorization in building age estimation, undermining claims of deep architectural understanding in VLMs. In applied contexts—such as heritage preservation, sustainability audits, and disaster assessment—this bias risks systematic neglect of data from marginalized regions and rare styles, with attendant fairness concerns.

A plausible implication is that benchmarking must integrate popularity-aware stratification to avoid overestimating VLM performance due to web-crawled data’s inherent visibility skew. Furthermore, pure vision models’ relatively better generalization to the "long tail" suggests ensemble or hybrid approaches may mitigate the identified bias, albeit with trade-offs in peak accuracy.

7. Future Directions and Recommendations

Suggested avenues for further research and dataset enrichment include:

Expansion with additional non-Western and early-period samples (e.g., using CMAB or bespoke low-resource collection).
Annotation of renovation/rebuilding events to distinguish original from reconstructed features.
Incorporation of fairness-aware or adversarial loss functions to penalize shortcut learning tied to popularity.
Utilization of synthetic data (e.g., via diffusion priors) and active learning to balance class representation.
Development of debiased prompting or retrieval strategies to reduce memorization.

The YearGuessr benchmark (Szu-Tu et al., 24 Dec 2025) thus establishes a multi-modal, ordinal regression standard, holding future VLMs to higher standards of generalization and providing a framework to address memorization-driven artifacts in model evaluation.

PDF Markdown Chat (Pro)

References (1)

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to YearGuessr Benchmark.