YearCLIP: Multi-Modal Year Estimation Model

Updated 25 December 2025

The paper introduces YearCLIP, a vision-language model that recasts building construction year prediction as a multi-modal ordinal regression problem using a frozen CLIP ViT-B/16 backbone.
It integrates image, geographic, and textual priors through a learned fusion pipeline that employs coarse-to-fine contrastive learning and style-specific reasoning prompts.
Evaluation on the YearGuessr benchmark shows that YearCLIP reduces mean absolute error and mitigates popularity bias compared to established baseline models.

YearCLIP is a vision-LLM specifically designed for building construction year estimation on the large-scale, multi-modal YearGuessr benchmark. It leverages a frozen CLIP ViT-B/16 backbone for both image and text encoding, integrates structured geographic and textual priors, and recasts the year prediction task as a multi-modal ordinal regression problem to robustly characterize both architectural history and model popularity bias. Its architecture combines coarse-to-fine ordinal contrastive learning, explainable architectural style and reasoning prompts, and learned multi-modal fusion for state-of-the-art open performance without heavy reliance on building memorization (Szu-Tu et al., 24 Dec 2025).

YearCLIP’s pipeline is constructed upon five core modules:

Image Encoder ( $f_v$ ): Each façade image $I$ is resized to $224 \times 224$ and processed by CLIP’s visual encoder $f_v(\cdot)$ , producing a $1 \times d$ raw feature $z_v^\text{raw}$ . A trainable MLP then projects this to a $d$ -dimensional embedding $z_v = \text{MLP}_v(z_v^\text{raw})$ .
Location Encoder ( $f_l$ ): GPS coordinates $g = (\phi, \lambda)$ undergo a random-Fourier-feature (RFF) mapping, followed by an MLP to yield $z_l^\text{raw} = f_l(g)$ . This passes through a ‘zero convolution’ layer—a 1D convolution with zero-initialized weights—to yield $z_l = \text{ZeroConv}(z_l^\text{raw})$ . This mechanism lets the network learn a residual fusion weight for geographic priors.
Text Branches:
- Style-class Encoder ( $f_c$ ): Seven coarse architectural style tokens (Roman, Gothic, Renaissance, Baroque, Neoclassical, Modern, Contemporary) are encoded via CLIP’s text encoder to $z_{c_i} = f_c(s_i)$ .
- Reason-prompt Encoder ( $f_r$ ): Manually defined prompts for architectural cues (e.g., roof, material, height) are encoded similarly as $z_{r_j} = f_r(r_j)$ .
Coarse-to-Fine Regressor ( $g(\cdot)$ ): The fused image-plus-location embedding ( $z_v+z_l$ ) and text embeddings are fed to a trainable network $g(\cdot)$ , computing cosine similarities with each style and reasoning prompt. The final similarity vector $s \in \mathbb{R}^{7+M}$ (for $M$ prompts) is mapped to a scalar construction year prediction $\hat{y}$ .

Fusion is fully learned: GPS fusion via zero-conv can be omitted at inference if location data is unavailable. The image and location representations are summed after learnable scaling, without manual weighting.

2. Ordinal Regression Formulation and Losses

YearCLIP frames year prediction as an ordinal problem with a two-level supervision:

Coarse Style Classification: Each true year $y$ is mapped to one of $K=7$ architectural style bins. For each input, the model computes $sim_{c_i} = \cos(z_{\text{input}}, z_{c_i})$ and produces probabilities $p_c = \text{softmax}(sim_{c}/\tau)$ . Cross-entropy loss is applied:

$L_{\text{coarse}} = -\sum_{i=1}^K y^{(i)}_{\text{style}} \log p_{c_i}$

Fine-grained Ranking-based Contrastive Loss (FCRC): Inspired by NumCLIP, this approach treats year labels as free-form text tokens (“the year 2024”), performing a batchwise ranking contrastive loss. For embedding pairs ( $z^i$ , $w^j$ ), where $w^j$ encodes the year of sample $j$ ,

$f(z^i, w^j) = \exp\left( \frac{\cos(z^i, w^j)}{\tau} \right)$

Negatives are weighted by their ordinal label distance $d_{i,j} = |y_i - y_j|$ , so

$\lambda_{i,j} = \text{Norm}(\beta \cdot d_{i,j})$

The FCRC loss is:

$L_{\text{FCRC}} = -\frac{1}{M}\sum_{i=1}^M \log \left[ \frac{f(z^i, w^i)}{f(z^i, w^i) + \sum_{j\neq i} \lambda_{i,j} f(z^i, w^j)} \right]$

Regression Penalty: An $\ell_1$ penalty ensures final predictions are numerically close to ground-truth:

$L_{\text{reg}} = \frac{1}{N} \sum_{i=1}^N |y_i - \hat{y}_i|$

Combined Loss: The total loss is an unweighted sum:

$L = L_{\text{coarse}} + L_{\text{FCRC}} + L_{\text{reg}}$

This structured approach robustly supervises both high-level style and fine, ordinal year distinctions, while permitting gradient flow through the multi-modal fusion pipeline.

3. Popularity-Aware Evaluation Metrics

YearCLIP introduces metrics explicitly quantifying popularity bias in construction year estimation.

Interval Accuracy ( $\mathrm{IA}_k$ ): For a tolerance $k$ ,

$\mathrm{IA}_k = \frac{1}{N} \sum_{i=1}^N 1[|y_i - \hat{y}_i| \leq k]$

with $k \in \{5, 20, 50, 100\}$ .

Popularity Stratification: Each image is bucketed by its Wikipedia pageview count $v_i$ into five bins from $<10^2$ to $>10^5$ views. Separate $\mathrm{IA}_5^{(b)}$ scores are reported by bin, revealing the effect of fame on model accuracy.
Popularity Bias Gap (“Gain”):

$\text{Gain} = \mathrm{IA}_5^{\text{high-pop}} - \mathrm{IA}_5^{\text{low-pop}}$

A positive Gain signifies accuracy is skewed towards famous (high-popularity) buildings, evidencing memorization tendencies in VLMs.

These metrics, along with stratification by continent and period, provide granular insight into both model generalization and memorization failure modes.

4. Training Workflow and Data Preprocessing

The YearGuessr dataset, curated for this task, comprises $55,546$ unique façade images from Wikipedia’s “Buildings_and_structures_by_year_of_completion” category, with GPS, textual, and view-count metadata.

Preprocessing: Images deduplicated by title; non-facades filtered by CLIP ViT-B/32 similarity to the phrase “a building façade”; anomalous test samples manually audited.
Split Strategy: Data is stratified by decade and continent into 60%/20%/20% train/val/test partitions (33,337/11,122/11,087 samples), ensuring no overlap in images or titles across splits.
Augmentation: Only baseline CLIP preprocessing (224×224 center crop, normalization) is applied; no additional geometric or photometric augmentation is used.
Optimization: All CLIP encoders are frozen; only adapter MLPs, zero-conv, and the regressor network are updated. RAdam is used, with learning rates $1 \times 10^{-5}$ (for MLPs/ZeroConv) and $1 \times 10^{-4}$ (for regressor), and multi-step scheduling (factor $\gamma = 0.1$ at epoch 60). Models are trained for 50 epochs, batch size 64, precision FP16, loss weights 1.0, temperature $\tau=0.07$ , and ranking scale $\beta=1.0$ .

5. Inference Pipeline and Explainability

Inference proceeds as follows for an input façade image $I$ (optionally including GPS $g$ ):

Images are resized, normalized, and encoded as $z_v = \text{MLP}_v(f_v(I))$ .
If GPS is available, $z_l = \text{ZeroConv}(f_l(g))$ is computed and $z_{\text{input}} = z_v + z_l$ ; otherwise, $z_{\text{input}} = z_v$ .
Cosine similarities to style embeddings $\{z_{c_i}\}$ and reason embeddings $\{z_{r_j}\}$ are computed.
The concatenated similarity vector $s = [sim_{c_1}, …, sim_{c_7}, sim_{r_1}, …, sim_{r_M}]$ passes through regressor $g$ to output the year prediction $\hat{y}$ , which is clamped to $[1001,2024]$ .
For transparent post-hoc rationales, top style and top reason cues are identified by their respective maximal similarities.

No smoothing or model ensembling is incorporated in the standard pipeline.

6. Quantitative Performance and Analysis

The following table summarizes YearCLIP’s test performance versus established baselines, as reported on the 11,087-sample YearGuessr test split (means and standard deviations over three seeds):

Method	MAE ↓	IA_5 ↑	IA_100 ↑	IA_5 (low pop) ↑	IA_5 (high pop) ↑	Gain
ResNet-50	54.14	10.44	88.68	12.39	9.14	–3.25
ConvNeXt-B	44.42	14.01	90.72	16.57	12.68	–3.89
ViT-B/16	49.16	12.50	89.52	15.82	6.78	–9.04
GeoCLIP	45.69	23.79	89.54	24.37	19.17	–5.19
NumCLIP	40.01	18.15	91.76	21.69	11.80	–9.89
YearCLIP	39.52	18.93	91.63	20.19	12.39	–7.80
Gemini 2.0–flash	33.91	29.71	92.75	24.23	58.41	+34.18

Key findings are as follows:

YearCLIP achieves a mean absolute error (MAE) reduction of ~13.5% over GeoCLIP and ~11.0% over ConvNeXt-B, underscoring the impact of multi-modal ordinal-contrastive training.
Open CLIP-based models, including YearCLIP and GeoCLIP, exhibit negative Gain, signifying better generalization to rare (low-popularity) buildings rather than mere memorization of famous structures.
Closed-source VLMs (e.g., Gemini 2.0–flash) display strongly positive Gain, indicative of increased accuracy for widely-known buildings and substantial memorization bias.
By continent, YearCLIP attains lowest MAE on buildings in the Americas (26.10 yr), highest in Africa (85.85 yr). Performance improves markedly on post-1800 buildings (MAE ≈ 27 yr) versus pre-1400 (MAE > 280 yr), partially as a function of dataset distribution assumptions.

7. Context, Implications, and Reproducibility

YearCLIP demonstrates that multi-modal ordinal regression with learned geographic fusion and explicit reasoning prompts mitigates popularity bias observed in leading VLMs for temporal-attribute estimation. All code, data splits, and curated prompts are pledged for release under CC BY-SA 4.0 and MIT licenses, ensuring both reproducibility and extensibility. This framework offers a rigorous benchmark for analyzing memorization vs. generalization in vision-language pretraining, salient for the development of robust, explainable temporal-attribute prediction systems (Szu-Tu et al., 24 Dec 2025).

Markdown Upgrade to Chat

References (1)

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YearCLIP Model.

YearCLIP: Multi-Modal Year Estimation Model

2. Ordinal Regression Formulation and Losses

3. Popularity-Aware Evaluation Metrics

4. Training Workflow and Data Preprocessing

5. Inference Pipeline and Explainability

6. Quantitative Performance and Analysis

7. Context, Implications, and Reproducibility

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

YearCLIP: Multi-Modal Year Estimation Model

1. Model Architecture and Multi-Modal Fusion

2. Ordinal Regression Formulation and Losses

3. Popularity-Aware Evaluation Metrics

4. Training Workflow and Data Preprocessing

5. Inference Pipeline and Explainability

6. Quantitative Performance and Analysis

7. Context, Implications, and Reproducibility

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research