YearCLIP: Multi-Modal Year Estimation Model
- The paper introduces YearCLIP, a vision-language model that recasts building construction year prediction as a multi-modal ordinal regression problem using a frozen CLIP ViT-B/16 backbone.
- It integrates image, geographic, and textual priors through a learned fusion pipeline that employs coarse-to-fine contrastive learning and style-specific reasoning prompts.
- Evaluation on the YearGuessr benchmark shows that YearCLIP reduces mean absolute error and mitigates popularity bias compared to established baseline models.
YearCLIP is a vision-LLM specifically designed for building construction year estimation on the large-scale, multi-modal YearGuessr benchmark. It leverages a frozen CLIP ViT-B/16 backbone for both image and text encoding, integrates structured geographic and textual priors, and recasts the year prediction task as a multi-modal ordinal regression problem to robustly characterize both architectural history and model popularity bias. Its architecture combines coarse-to-fine ordinal contrastive learning, explainable architectural style and reasoning prompts, and learned multi-modal fusion for state-of-the-art open performance without heavy reliance on building memorization (Szu-Tu et al., 24 Dec 2025).
1. Model Architecture and Multi-Modal Fusion
YearCLIP’s pipeline is constructed upon five core modules:
- Image Encoder (): Each façade image is resized to and processed by CLIP’s visual encoder , producing a raw feature . A trainable MLP then projects this to a -dimensional embedding .
- Location Encoder (): GPS coordinates undergo a random-Fourier-feature (RFF) mapping, followed by an MLP to yield . This passes through a ‘zero convolution’ layer—a 1D convolution with zero-initialized weights—to yield . This mechanism lets the network learn a residual fusion weight for geographic priors.
- Text Branches:
- Style-class Encoder (): Seven coarse architectural style tokens (Roman, Gothic, Renaissance, Baroque, Neoclassical, Modern, Contemporary) are encoded via CLIP’s text encoder to .
- Reason-prompt Encoder (): Manually defined prompts for architectural cues (e.g., roof, material, height) are encoded similarly as .
- Coarse-to-Fine Regressor (): The fused image-plus-location embedding () and text embeddings are fed to a trainable network , computing cosine similarities with each style and reasoning prompt. The final similarity vector (for prompts) is mapped to a scalar construction year prediction .
Fusion is fully learned: GPS fusion via zero-conv can be omitted at inference if location data is unavailable. The image and location representations are summed after learnable scaling, without manual weighting.
2. Ordinal Regression Formulation and Losses
YearCLIP frames year prediction as an ordinal problem with a two-level supervision:
- Coarse Style Classification: Each true year is mapped to one of architectural style bins. For each input, the model computes and produces probabilities . Cross-entropy loss is applied:
- Fine-grained Ranking-based Contrastive Loss (FCRC): Inspired by NumCLIP, this approach treats year labels as free-form text tokens (“the year 2024”), performing a batchwise ranking contrastive loss. For embedding pairs (, ), where encodes the year of sample ,
Negatives are weighted by their ordinal label distance , so
The FCRC loss is:
- Regression Penalty: An penalty ensures final predictions are numerically close to ground-truth:
- Combined Loss: The total loss is an unweighted sum:
This structured approach robustly supervises both high-level style and fine, ordinal year distinctions, while permitting gradient flow through the multi-modal fusion pipeline.
3. Popularity-Aware Evaluation Metrics
YearCLIP introduces metrics explicitly quantifying popularity bias in construction year estimation.
- Interval Accuracy (): For a tolerance ,
with .
- Popularity Stratification: Each image is bucketed by its Wikipedia pageview count into five bins from to views. Separate scores are reported by bin, revealing the effect of fame on model accuracy.
- Popularity Bias Gap (“Gain”):
A positive Gain signifies accuracy is skewed towards famous (high-popularity) buildings, evidencing memorization tendencies in VLMs.
These metrics, along with stratification by continent and period, provide granular insight into both model generalization and memorization failure modes.
4. Training Workflow and Data Preprocessing
The YearGuessr dataset, curated for this task, comprises $55,546$ unique façade images from Wikipedia’s “Buildings_and_structures_by_year_of_completion” category, with GPS, textual, and view-count metadata.
- Preprocessing: Images deduplicated by title; non-facades filtered by CLIP ViT-B/32 similarity to the phrase “a building façade”; anomalous test samples manually audited.
- Split Strategy: Data is stratified by decade and continent into 60%/20%/20% train/val/test partitions (33,337/11,122/11,087 samples), ensuring no overlap in images or titles across splits.
- Augmentation: Only baseline CLIP preprocessing (224×224 center crop, normalization) is applied; no additional geometric or photometric augmentation is used.
- Optimization: All CLIP encoders are frozen; only adapter MLPs, zero-conv, and the regressor network are updated. RAdam is used, with learning rates (for MLPs/ZeroConv) and (for regressor), and multi-step scheduling (factor at epoch 60). Models are trained for 50 epochs, batch size 64, precision FP16, loss weights 1.0, temperature , and ranking scale .
5. Inference Pipeline and Explainability
Inference proceeds as follows for an input façade image (optionally including GPS ):
- Images are resized, normalized, and encoded as .
- If GPS is available, is computed and ; otherwise, .
- Cosine similarities to style embeddings and reason embeddings are computed.
- The concatenated similarity vector passes through regressor to output the year prediction , which is clamped to .
- For transparent post-hoc rationales, top style and top reason cues are identified by their respective maximal similarities.
No smoothing or model ensembling is incorporated in the standard pipeline.
6. Quantitative Performance and Analysis
The following table summarizes YearCLIP’s test performance versus established baselines, as reported on the 11,087-sample YearGuessr test split (means and standard deviations over three seeds):
| Method | MAE ↓ | IA_5 ↑ | IA_100 ↑ | IA_5 (low pop) ↑ | IA_5 (high pop) ↑ | Gain |
|---|---|---|---|---|---|---|
| ResNet-50 | 54.14 | 10.44 | 88.68 | 12.39 | 9.14 | –3.25 |
| ConvNeXt-B | 44.42 | 14.01 | 90.72 | 16.57 | 12.68 | –3.89 |
| ViT-B/16 | 49.16 | 12.50 | 89.52 | 15.82 | 6.78 | –9.04 |
| GeoCLIP | 45.69 | 23.79 | 89.54 | 24.37 | 19.17 | –5.19 |
| NumCLIP | 40.01 | 18.15 | 91.76 | 21.69 | 11.80 | –9.89 |
| YearCLIP | 39.52 | 18.93 | 91.63 | 20.19 | 12.39 | –7.80 |
| Gemini 2.0–flash | 33.91 | 29.71 | 92.75 | 24.23 | 58.41 | +34.18 |
Key findings are as follows:
- YearCLIP achieves a mean absolute error (MAE) reduction of ~13.5% over GeoCLIP and ~11.0% over ConvNeXt-B, underscoring the impact of multi-modal ordinal-contrastive training.
- Open CLIP-based models, including YearCLIP and GeoCLIP, exhibit negative Gain, signifying better generalization to rare (low-popularity) buildings rather than mere memorization of famous structures.
- Closed-source VLMs (e.g., Gemini 2.0–flash) display strongly positive Gain, indicative of increased accuracy for widely-known buildings and substantial memorization bias.
- By continent, YearCLIP attains lowest MAE on buildings in the Americas (26.10 yr), highest in Africa (85.85 yr). Performance improves markedly on post-1800 buildings (MAE ≈ 27 yr) versus pre-1400 (MAE > 280 yr), partially as a function of dataset distribution assumptions.
7. Context, Implications, and Reproducibility
YearCLIP demonstrates that multi-modal ordinal regression with learned geographic fusion and explicit reasoning prompts mitigates popularity bias observed in leading VLMs for temporal-attribute estimation. All code, data splits, and curated prompts are pledged for release under CC BY-SA 4.0 and MIT licenses, ensuring both reproducibility and extensibility. This framework offers a rigorous benchmark for analyzing memorization vs. generalization in vision-language pretraining, salient for the development of robust, explainable temporal-attribute prediction systems (Szu-Tu et al., 24 Dec 2025).