Papers
Topics
Authors
Recent
2000 character limit reached

YearCLIP: Multi-Modal Year Prediction

Updated 27 December 2025
  • The paper introduces an innovative method that fuses visual, textual, and geographical data with prompt-based similarity extraction and a coarse-to-fine ordinal regression head.
  • The methodology leverages frozen CLIP backbones with learnable MLP adapters, location encoders, and parallel prompt branches to capture nuanced architectural cues.
  • The design notably mitigates popularity bias by calibrating ordinal distances between construction years, enhancing temporal reasoning beyond landmark memorization.

YearCLIP is a multi-modal ordinal regression architecture designed to predict the construction year of buildings from photographs, with the explicit goal of mitigating popularity bias in vision-LLMs (VLMs). Developed for the YearGuessr benchmark, YearCLIP fuses visual, textual, and optional geographic cues using a combination of pre-trained and learnable modules. The architecture is characterized by its distinctive prompt-based similarity feature extraction, geographically-aware fusion layer, and a coarse-to-fine ordinal regression head that collectively enable nuanced temporal reasoning beyond simple memorization of popular landmarks (Szu-Tu et al., 24 Dec 2025).

1. Multi-Modal Pipeline Composition

YearCLIP ingests multi-modal inputs, including a 224×224 building façade image II and optional GPS coordinates g=(ϕ,λ)g = (\phi, \lambda). The architecture is organized into three main branches:

  • Visual branch: The frozen CLIP ViT-B/16 image encoder fvf_v converts II to a 512-dimensional visual feature zvrawz_v^{raw}. This passes through a small, learnable MLP adapter to yield zvR512z_v \in \mathbb{R}^{512}.
  • Location branch (optional): GPS coordinates are mapped to random Fourier Features, passed through an MLP to produce zlrawR512z_l^{raw} \in \mathbb{R}^{512} and further transformed by a learnable zero-initialized 1×1 convolution ("ZeroConv") layer to obtain zlz_l. Fusion with vision occurs by element-wise addition: zinput=zvz_{input} = z_v (no location) or zinput=zv+zlz_{input} = z_v + z_l (with location).
  • Textual prompt branches: Both branches use the frozen CLIP text encoder:
    • Style prompts: Seven coarse architectural style tokens {si}\{s_i\} produce embeddings {zci}\{z_{c_i}\}.
    • Reasoning prompts: A set of fine-grained reasoning cues (e.g., roof type, wall material) {rjk}\{r_{jk}\} yield embeddings {zrjk}\{z_{r_{jk}}\}.

Cosine similarities between zinputz_{input} and style embeddings (cos(zinput,zci)\mathrm{cos}(z_{input}, z_{c_i})) and reasoning embeddings (cos(zinput,zrjk)\mathrm{cos}(z_{input}, z_{r_{jk}})) are concatenated to form a prompt-based similarity vector sR7+jsubcatsjs \in \mathbb{R}^{7 + \sum_j|subcats_j|}.

2. Adaptations and Extensions Beyond CLIP

YearCLIP departs from the standard CLIP pipeline through several targeted architectural innovations:

  • All CLIP backbones (vision, caption, reasoning encoders) remain frozen to preserve broad visual-textual knowledge while enabling task-specific adaptation via lightweight modules.
  • A vision MLP adapter is introduced immediately after the CLIP vision tower.
  • A specialized location encoder combines random Fourier feature mapping with an MLP, followed by a zero-initialized convolution ("ZeroConv") layer, enabling location-based disambiguation without overwhelming the visual signal.
  • Parallel prompt branches inject explicit architectural priors, encouraging the extraction of style and construction reasoning cues otherwise underrepresented in generic VLM pretraining.
  • Contrasting head replacement: The traditional CLIP contrastive head is supplanted by a coarse-to-fine ordinal regression head (gg), optimized for the temporally structured regression target.

3. Ordinal Regression Head: Design and Output

The regression head receives the prompt-based similarity vector (ss) as input, with dimensionality Din=7+RD_{in} = 7 + R (e.g., R20R \approx 20, so Din27D_{in} \approx 27). Its structure is as follows:

  • Hidden layer: Fully connected, 512 units, ReLU activation, optional dropout (0.1).
  • Output layer: Fully connected, k=7k=7 logits corresponding to coarse temporal bins (aligned with major architectural periods).
  • Prediction: A softmax layer yields distribution p\vec{p} over bins; continuous prediction is generated as a weighted average of bin midpoints {bi}\{b_i\}:

y^=i=1kpibi,\hat{y} = \sum_{i=1}^k p_i \cdot b_i,

with pi=Softmax(g(s))ip_i = \mathrm{Softmax}(g(s))_i.

  • Rationale output: Returns the argmax style token and most influential reasoning subcategory per group, supporting interpretability.

4. Training Objectives and Evaluation Metrics

YearCLIP is trained with multi-term objectives designed to enforce ordinal structure and calibrate uncertainty:

  • Fine-grained Cross-modal Ranking-based Contrastive Loss (FCRC):

LFCRC=1Mi=1Mlogexp(cos(zi,wi)/τ)exp(cos(zi,wi)/τ)+jiλi,jexp(cos(zi,wj)/τ)L_{FCRC} = -\frac{1}{M}\sum_{i=1}^M \log \frac{\exp( \mathrm{cos}(z_i, w_i)/\tau ) }{ \exp( \mathrm{cos}(z_i, w_i) / \tau ) + \sum_{j \ne i} \lambda_{i,j} \exp( \mathrm{cos}(z_i, w_j)/\tau ) }

with label-dependent weights λi,jyiyj\lambda_{i, j} \propto |y_i - y_j| that increase penalty for negatives with similar construction years.

  • Auxiliary objectives: Equally weighted cross-entropy for bin classification, KL-divergence from smoothed targets, and (optionally) 1\ell_1 regression on y^\hat{y}.
  • Metrics:
    • MAE: Mean Absolute Error between y^\hat{y} and yy
    • Interval Accuracy (IAk_k): Fraction with yy^k|y - \hat{y}| \leq k for k{5,20,50,100}k \in \{5,20,50,100\}
    • Popularity-aware interval accuracy and “popularity gain”: IA5(high)IA5(low)IA_5(\mathrm{high}) - IA_5(\mathrm{low})

5. Hyperparameter Choices and Optimization

Component configurations and training recipes are as follows:

Module Architecture Hyperparameters
CLIP encoders ViT-B/16, 512-dim output Frozen; no fine-tuning
Vision adapter MLP (512 → 512) RAdam, lr=1e–5
Ordinal regression MLP (27 → 512 → 7) Adam, lr=1e–4, β=(0.9,0.999)\beta= (0.9, 0.999)
Reasoning bank \sim20 subcategories, 7 styles Frozen prompts
Location encoder RFF (dim=60) → MLP (512) → ZeroConv
Schedule/Regular Batch 64, epochs 50, dropout 0.1, wd 1e–4 Mixed-precision (16-bit)

All loss terms are weighted equally.

6. Novel Components and Equations

YearCLIP introduces several notable mathematical and architectural innovations:

  • ZeroConv fusion: zl=ZeroConv(MLPRFF(ϕ,λ))z_l = \mathrm{ZeroConv}(\mathrm{MLP}_{\mathrm{RFF}}(\phi, \lambda)), initialized so that zl0z_l \approx 0 initially, allowing the network to learn selective location conditioning.
  • Prompt-based similarity composite: s=[cos(zinput,zc1),,cos(zinput,zc7),cos(zinput,zr11),]Ts = [\mathrm{cos}(z_{input}, z_{c_1}), \ldots, \mathrm{cos}(z_{input}, z_{c_7}), \mathrm{cos}(z_{input}, z_{r_{11}}), \ldots ]^T
  • Coarse-to-fine year regression: y^=i=1kpibi\hat{y} = \sum_{i=1}^k p_i \cdot b_i
  • Distance-weighted negatives in FCRC: λi,j=Norm(βyiyj)\lambda_{i, j} = \mathrm{Norm}(\beta |y_i - y_j|) to modulate the magnitude of the contrastive penalty

7. Mitigation of Popularity Bias

YearCLIP's design explicitly addresses the over-reliance of large VLMs on memorization of high-profile, widely-photographed structures. Critical mitigation mechanisms include:

  • Ordinal contrastive loss enforces respect for numeric distances between construction years, limiting superficial text-match memorization.
  • Low-level architectural cues from reasoning prompts ensure the model leverages constructive visual properties rather than only relying on global or superficial patterns.
  • Geographically-aware fusion enables local disambiguation without learning to "shortcut" via memorized geolocated images.
  • YearCLIP demonstrates a modest drop in IA5_5 between rare and popular buildings (–7.8%), compared to much larger gains (16–34%) for closed-source VLMs, providing evidence for reduced reliance on landmark memorization and greater capacity for generalizable temporal reasoning (Szu-Tu et al., 24 Dec 2025).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to YearCLIP Model Architecture.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube