Papers
Topics
Authors
Recent
2000 character limit reached

YearCLIP: Multi-Modal Year Estimation Model

Updated 25 December 2025
  • The paper introduces YearCLIP, a vision-language model that recasts building construction year prediction as a multi-modal ordinal regression problem using a frozen CLIP ViT-B/16 backbone.
  • It integrates image, geographic, and textual priors through a learned fusion pipeline that employs coarse-to-fine contrastive learning and style-specific reasoning prompts.
  • Evaluation on the YearGuessr benchmark shows that YearCLIP reduces mean absolute error and mitigates popularity bias compared to established baseline models.

YearCLIP is a vision-LLM specifically designed for building construction year estimation on the large-scale, multi-modal YearGuessr benchmark. It leverages a frozen CLIP ViT-B/16 backbone for both image and text encoding, integrates structured geographic and textual priors, and recasts the year prediction task as a multi-modal ordinal regression problem to robustly characterize both architectural history and model popularity bias. Its architecture combines coarse-to-fine ordinal contrastive learning, explainable architectural style and reasoning prompts, and learned multi-modal fusion for state-of-the-art open performance without heavy reliance on building memorization (Szu-Tu et al., 24 Dec 2025).

1. Model Architecture and Multi-Modal Fusion

YearCLIP’s pipeline is constructed upon five core modules:

  • Image Encoder (fvf_v): Each façade image II is resized to 224×224224 \times 224 and processed by CLIP’s visual encoder fv()f_v(\cdot), producing a 1×d1 \times d raw feature zvrawz_v^\text{raw}. A trainable MLP then projects this to a dd-dimensional embedding zv=MLPv(zvraw)z_v = \text{MLP}_v(z_v^\text{raw}).
  • Location Encoder (flf_l): GPS coordinates g=(ϕ,λ)g = (\phi, \lambda) undergo a random-Fourier-feature (RFF) mapping, followed by an MLP to yield zlraw=fl(g)z_l^\text{raw} = f_l(g). This passes through a ‘zero convolution’ layer—a 1D convolution with zero-initialized weights—to yield zl=ZeroConv(zlraw)z_l = \text{ZeroConv}(z_l^\text{raw}). This mechanism lets the network learn a residual fusion weight for geographic priors.
  • Text Branches:
    • Style-class Encoder (fcf_c): Seven coarse architectural style tokens (Roman, Gothic, Renaissance, Baroque, Neoclassical, Modern, Contemporary) are encoded via CLIP’s text encoder to zci=fc(si)z_{c_i} = f_c(s_i).
    • Reason-prompt Encoder (frf_r): Manually defined prompts for architectural cues (e.g., roof, material, height) are encoded similarly as zrj=fr(rj)z_{r_j} = f_r(r_j).
  • Coarse-to-Fine Regressor (g()g(\cdot)): The fused image-plus-location embedding (zv+zlz_v+z_l) and text embeddings are fed to a trainable network g()g(\cdot), computing cosine similarities with each style and reasoning prompt. The final similarity vector sR7+Ms \in \mathbb{R}^{7+M} (for MM prompts) is mapped to a scalar construction year prediction y^\hat{y}.

Fusion is fully learned: GPS fusion via zero-conv can be omitted at inference if location data is unavailable. The image and location representations are summed after learnable scaling, without manual weighting.

2. Ordinal Regression Formulation and Losses

YearCLIP frames year prediction as an ordinal problem with a two-level supervision:

  • Coarse Style Classification: Each true year yy is mapped to one of K=7K=7 architectural style bins. For each input, the model computes simci=cos(zinput,zci)sim_{c_i} = \cos(z_{\text{input}}, z_{c_i}) and produces probabilities pc=softmax(simc/τ)p_c = \text{softmax}(sim_{c}/\tau). Cross-entropy loss is applied:

Lcoarse=i=1Kystyle(i)logpciL_{\text{coarse}} = -\sum_{i=1}^K y^{(i)}_{\text{style}} \log p_{c_i}

  • Fine-grained Ranking-based Contrastive Loss (FCRC): Inspired by NumCLIP, this approach treats year labels as free-form text tokens (“the year 2024”), performing a batchwise ranking contrastive loss. For embedding pairs (ziz^i, wjw^j), where wjw^j encodes the year of sample jj,

f(zi,wj)=exp(cos(zi,wj)τ)f(z^i, w^j) = \exp\left( \frac{\cos(z^i, w^j)}{\tau} \right)

Negatives are weighted by their ordinal label distance di,j=yiyjd_{i,j} = |y_i - y_j|, so

λi,j=Norm(βdi,j)\lambda_{i,j} = \text{Norm}(\beta \cdot d_{i,j})

The FCRC loss is:

LFCRC=1Mi=1Mlog[f(zi,wi)f(zi,wi)+jiλi,jf(zi,wj)]L_{\text{FCRC}} = -\frac{1}{M}\sum_{i=1}^M \log \left[ \frac{f(z^i, w^i)}{f(z^i, w^i) + \sum_{j\neq i} \lambda_{i,j} f(z^i, w^j)} \right]

  • Regression Penalty: An 1\ell_1 penalty ensures final predictions are numerically close to ground-truth:

Lreg=1Ni=1Nyiy^iL_{\text{reg}} = \frac{1}{N} \sum_{i=1}^N |y_i - \hat{y}_i|

  • Combined Loss: The total loss is an unweighted sum:

L=Lcoarse+LFCRC+LregL = L_{\text{coarse}} + L_{\text{FCRC}} + L_{\text{reg}}

This structured approach robustly supervises both high-level style and fine, ordinal year distinctions, while permitting gradient flow through the multi-modal fusion pipeline.

3. Popularity-Aware Evaluation Metrics

YearCLIP introduces metrics explicitly quantifying popularity bias in construction year estimation.

  • Interval Accuracy (IAk\mathrm{IA}_k): For a tolerance kk,

IAk=1Ni=1N1[yiy^ik]\mathrm{IA}_k = \frac{1}{N} \sum_{i=1}^N 1[|y_i - \hat{y}_i| \leq k]

with k{5,20,50,100}k \in \{5, 20, 50, 100\}.

  • Popularity Stratification: Each image is bucketed by its Wikipedia pageview count viv_i into five bins from <102<10^2 to >105>10^5 views. Separate IA5(b)\mathrm{IA}_5^{(b)} scores are reported by bin, revealing the effect of fame on model accuracy.
  • Popularity Bias Gap (“Gain”):

Gain=IA5high-popIA5low-pop\text{Gain} = \mathrm{IA}_5^{\text{high-pop}} - \mathrm{IA}_5^{\text{low-pop}}

A positive Gain signifies accuracy is skewed towards famous (high-popularity) buildings, evidencing memorization tendencies in VLMs.

These metrics, along with stratification by continent and period, provide granular insight into both model generalization and memorization failure modes.

4. Training Workflow and Data Preprocessing

The YearGuessr dataset, curated for this task, comprises $55,546$ unique façade images from Wikipedia’s “Buildings_and_structures_by_year_of_completion” category, with GPS, textual, and view-count metadata.

  • Preprocessing: Images deduplicated by title; non-facades filtered by CLIP ViT-B/32 similarity to the phrase “a building façade”; anomalous test samples manually audited.
  • Split Strategy: Data is stratified by decade and continent into 60%/20%/20% train/val/test partitions (33,337/11,122/11,087 samples), ensuring no overlap in images or titles across splits.
  • Augmentation: Only baseline CLIP preprocessing (224×224 center crop, normalization) is applied; no additional geometric or photometric augmentation is used.
  • Optimization: All CLIP encoders are frozen; only adapter MLPs, zero-conv, and the regressor network are updated. RAdam is used, with learning rates 1×1051 \times 10^{-5} (for MLPs/ZeroConv) and 1×1041 \times 10^{-4} (for regressor), and multi-step scheduling (factor γ=0.1\gamma = 0.1 at epoch 60). Models are trained for 50 epochs, batch size 64, precision FP16, loss weights 1.0, temperature τ=0.07\tau=0.07, and ranking scale β=1.0\beta=1.0.

5. Inference Pipeline and Explainability

Inference proceeds as follows for an input façade image II (optionally including GPS gg):

  1. Images are resized, normalized, and encoded as zv=MLPv(fv(I))z_v = \text{MLP}_v(f_v(I)).
  2. If GPS is available, zl=ZeroConv(fl(g))z_l = \text{ZeroConv}(f_l(g)) is computed and zinput=zv+zlz_{\text{input}} = z_v + z_l; otherwise, zinput=zvz_{\text{input}} = z_v.
  3. Cosine similarities to style embeddings {zci}\{z_{c_i}\} and reason embeddings {zrj}\{z_{r_j}\} are computed.
  4. The concatenated similarity vector s=[simc1,,simc7,simr1,,simrM]s = [sim_{c_1}, …, sim_{c_7}, sim_{r_1}, …, sim_{r_M}] passes through regressor gg to output the year prediction y^\hat{y}, which is clamped to [1001,2024][1001,2024].
  5. For transparent post-hoc rationales, top style and top reason cues are identified by their respective maximal similarities.

No smoothing or model ensembling is incorporated in the standard pipeline.

6. Quantitative Performance and Analysis

The following table summarizes YearCLIP’s test performance versus established baselines, as reported on the 11,087-sample YearGuessr test split (means and standard deviations over three seeds):

Method MAE IA_5 ↑ IA_100 ↑ IA_5 (low pop) ↑ IA_5 (high pop) ↑ Gain
ResNet-50 54.14 10.44 88.68 12.39 9.14 –3.25
ConvNeXt-B 44.42 14.01 90.72 16.57 12.68 –3.89
ViT-B/16 49.16 12.50 89.52 15.82 6.78 –9.04
GeoCLIP 45.69 23.79 89.54 24.37 19.17 –5.19
NumCLIP 40.01 18.15 91.76 21.69 11.80 –9.89
YearCLIP 39.52 18.93 91.63 20.19 12.39 –7.80
Gemini 2.0–flash 33.91 29.71 92.75 24.23 58.41 +34.18

Key findings are as follows:

  • YearCLIP achieves a mean absolute error (MAE) reduction of ~13.5% over GeoCLIP and ~11.0% over ConvNeXt-B, underscoring the impact of multi-modal ordinal-contrastive training.
  • Open CLIP-based models, including YearCLIP and GeoCLIP, exhibit negative Gain, signifying better generalization to rare (low-popularity) buildings rather than mere memorization of famous structures.
  • Closed-source VLMs (e.g., Gemini 2.0–flash) display strongly positive Gain, indicative of increased accuracy for widely-known buildings and substantial memorization bias.
  • By continent, YearCLIP attains lowest MAE on buildings in the Americas (26.10 yr), highest in Africa (85.85 yr). Performance improves markedly on post-1800 buildings (MAE ≈ 27 yr) versus pre-1400 (MAE > 280 yr), partially as a function of dataset distribution assumptions.

7. Context, Implications, and Reproducibility

YearCLIP demonstrates that multi-modal ordinal regression with learned geographic fusion and explicit reasoning prompts mitigates popularity bias observed in leading VLMs for temporal-attribute estimation. All code, data splits, and curated prompts are pledged for release under CC BY-SA 4.0 and MIT licenses, ensuring both reproducibility and extensibility. This framework offers a rigorous benchmark for analyzing memorization vs. generalization in vision-language pretraining, salient for the development of robust, explainable temporal-attribute prediction systems (Szu-Tu et al., 24 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to YearCLIP Model.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube