YearCLIP: Multi-Modal Year Prediction
- The paper introduces an innovative method that fuses visual, textual, and geographical data with prompt-based similarity extraction and a coarse-to-fine ordinal regression head.
- The methodology leverages frozen CLIP backbones with learnable MLP adapters, location encoders, and parallel prompt branches to capture nuanced architectural cues.
- The design notably mitigates popularity bias by calibrating ordinal distances between construction years, enhancing temporal reasoning beyond landmark memorization.
YearCLIP is a multi-modal ordinal regression architecture designed to predict the construction year of buildings from photographs, with the explicit goal of mitigating popularity bias in vision-LLMs (VLMs). Developed for the YearGuessr benchmark, YearCLIP fuses visual, textual, and optional geographic cues using a combination of pre-trained and learnable modules. The architecture is characterized by its distinctive prompt-based similarity feature extraction, geographically-aware fusion layer, and a coarse-to-fine ordinal regression head that collectively enable nuanced temporal reasoning beyond simple memorization of popular landmarks (Szu-Tu et al., 24 Dec 2025).
1. Multi-Modal Pipeline Composition
YearCLIP ingests multi-modal inputs, including a 224×224 building façade image and optional GPS coordinates . The architecture is organized into three main branches:
- Visual branch: The frozen CLIP ViT-B/16 image encoder converts to a 512-dimensional visual feature . This passes through a small, learnable MLP adapter to yield .
- Location branch (optional): GPS coordinates are mapped to random Fourier Features, passed through an MLP to produce and further transformed by a learnable zero-initialized 1×1 convolution ("ZeroConv") layer to obtain . Fusion with vision occurs by element-wise addition: (no location) or (with location).
- Textual prompt branches: Both branches use the frozen CLIP text encoder:
- Style prompts: Seven coarse architectural style tokens produce embeddings .
- Reasoning prompts: A set of fine-grained reasoning cues (e.g., roof type, wall material) yield embeddings .
Cosine similarities between and style embeddings () and reasoning embeddings () are concatenated to form a prompt-based similarity vector .
2. Adaptations and Extensions Beyond CLIP
YearCLIP departs from the standard CLIP pipeline through several targeted architectural innovations:
- All CLIP backbones (vision, caption, reasoning encoders) remain frozen to preserve broad visual-textual knowledge while enabling task-specific adaptation via lightweight modules.
- A vision MLP adapter is introduced immediately after the CLIP vision tower.
- A specialized location encoder combines random Fourier feature mapping with an MLP, followed by a zero-initialized convolution ("ZeroConv") layer, enabling location-based disambiguation without overwhelming the visual signal.
- Parallel prompt branches inject explicit architectural priors, encouraging the extraction of style and construction reasoning cues otherwise underrepresented in generic VLM pretraining.
- Contrasting head replacement: The traditional CLIP contrastive head is supplanted by a coarse-to-fine ordinal regression head (), optimized for the temporally structured regression target.
3. Ordinal Regression Head: Design and Output
The regression head receives the prompt-based similarity vector () as input, with dimensionality (e.g., , so ). Its structure is as follows:
- Hidden layer: Fully connected, 512 units, ReLU activation, optional dropout (0.1).
- Output layer: Fully connected, logits corresponding to coarse temporal bins (aligned with major architectural periods).
- Prediction: A softmax layer yields distribution over bins; continuous prediction is generated as a weighted average of bin midpoints :
with .
- Rationale output: Returns the argmax style token and most influential reasoning subcategory per group, supporting interpretability.
4. Training Objectives and Evaluation Metrics
YearCLIP is trained with multi-term objectives designed to enforce ordinal structure and calibrate uncertainty:
- Fine-grained Cross-modal Ranking-based Contrastive Loss (FCRC):
with label-dependent weights that increase penalty for negatives with similar construction years.
- Auxiliary objectives: Equally weighted cross-entropy for bin classification, KL-divergence from smoothed targets, and (optionally) regression on .
- Metrics:
- MAE: Mean Absolute Error between and
- Interval Accuracy (IA): Fraction with for
- Popularity-aware interval accuracy and “popularity gain”:
5. Hyperparameter Choices and Optimization
Component configurations and training recipes are as follows:
| Module | Architecture | Hyperparameters |
|---|---|---|
| CLIP encoders | ViT-B/16, 512-dim output | Frozen; no fine-tuning |
| Vision adapter | MLP (512 → 512) | RAdam, lr=1e–5 |
| Ordinal regression | MLP (27 → 512 → 7) | Adam, lr=1e–4, |
| Reasoning bank | 20 subcategories, 7 styles | Frozen prompts |
| Location encoder | RFF (dim=60) → MLP (512) → ZeroConv | |
| Schedule/Regular | Batch 64, epochs 50, dropout 0.1, wd 1e–4 | Mixed-precision (16-bit) |
All loss terms are weighted equally.
6. Novel Components and Equations
YearCLIP introduces several notable mathematical and architectural innovations:
- ZeroConv fusion: , initialized so that initially, allowing the network to learn selective location conditioning.
- Prompt-based similarity composite:
- Coarse-to-fine year regression:
- Distance-weighted negatives in FCRC: to modulate the magnitude of the contrastive penalty
7. Mitigation of Popularity Bias
YearCLIP's design explicitly addresses the over-reliance of large VLMs on memorization of high-profile, widely-photographed structures. Critical mitigation mechanisms include:
- Ordinal contrastive loss enforces respect for numeric distances between construction years, limiting superficial text-match memorization.
- Low-level architectural cues from reasoning prompts ensure the model leverages constructive visual properties rather than only relying on global or superficial patterns.
- Geographically-aware fusion enables local disambiguation without learning to "shortcut" via memorized geolocated images.
- YearCLIP demonstrates a modest drop in IA between rare and popular buildings (–7.8%), compared to much larger gains (16–34%) for closed-source VLMs, providing evidence for reduced reliance on landmark memorization and greater capacity for generalizable temporal reasoning (Szu-Tu et al., 24 Dec 2025).