Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 64 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

SRE-CLIP: Semantic Relation-Enhanced Adapter

Updated 28 October 2025
  • The paper introduces SRE-CLIP, which integrates attention-based image adapters and semantic graph propagation with CLIP for robust domain-adaptive zero-shot learning.
  • It employs novel loss functions, including semantic relation structure loss and alignment retention, to maintain cross-modal alignment during adaptation.
  • Empirical results on I2AwA and I2WebV benchmarks show significant improvements, with up to a 21.3-point increase in unseen accuracy over baseline models.

Semantic Relation-Enhanced CLIP (SRE-CLIP) Adapter

The Semantic Relation-Enhanced CLIP (SRE-CLIP) Adapter is a framework for domain-adaptive zero-shot learning (DAZSL) designed to address the challenge of transferring knowledge from labeled source domains to unlabeled target domains while enabling robust recognition of unseen categories. SRE-CLIP introduces a structured mechanism for integrating inter-class semantic relations and preserves cross-modal alignment, leveraging the representational power of vision-language foundation models such as CLIP.

1. Background and Motivation

Domain-Adaptive Zero-Shot Learning (DAZSL) requires models to generalize across both domain shifts (e.g., web to real-world images) and category shifts (seen to unseen classes). Classical Zero-Shot Learning (ZSL) typically focuses on leveraging class-level semantic information, while Unsupervised Domain Adaptation (UDA) emphasizes transferring knowledge across data domains. DAZSL represents a more challenging setting requiring simultaneous cross-domain and cross-category generalization.

CLIP, pre-trained on large-scale image-text pairs, offers a cross-modal semantic space suitable for such tasks. However, prior work has identified two significant limitations in vanilla and conventional CLIP adapters when applied to DAZSL:

  1. Lack of Semantic Relation Guidance: Standard protocols typically represent each class by a single text prototype (e.g., its name), failing to encode rich hierarchical and relational structure (e.g., subclasses, synonyms, hypernyms).
  2. Degraded Cross-Modal Alignment: Fine-tuning CLIP for target domain adaptation often degrades the crucial initial alignment between image and text representations, undermining zero-shot transfer ability, especially for categories unseen during training.

SRE-CLIP was developed to directly remedy these issues by learning with explicit semantic relation constraints and regularizing cross-modal projections.

2. Model Architecture

SRE-CLIP’s architecture consists of two tightly coupled branches that update both image and class prototype representations:

2.1 Image Encoding and Attention-Based Adapter

  • The CLIP vision encoder outputs image features fif_i.
  • These are fed through an attention-based adapter defined as:

vi=Wvfi+softmax(Wqfi(Wkfi)dk)Wvfiv_i = W_v f_i + \text{softmax}\left( \frac{W_q f_i (W_k f_i)^\top}{\sqrt{d_k}} \right) W_v f_i

where Wq,Wk,WvW_q, W_k, W_v are trainable linear projections and dkd_k is a dimensionality constant.

  • The adapter enhances the image representation by modulating with content-dependent attention and allows efficient fine-tuning while maintaining the backbone CLIP parameters largely frozen.

2.2 Class Prototype Learning with Semantic Graph Propagation

  • For each class CiC_i, class names are augmented using WordNet synonyms; concatenated prompts are encoded via CLIP’s text encoder and averaged to eie_i.
  • A semantic relation graph GG is built from WordNet, capturing taxonomic relations among all classes (including ancestors).
  • To propagate semantic knowledge across related classes, a Graph Convolutional Network (GCN) is applied:

pi=GCN(ei,G)+Wei+bp_i = GCN(e_i, G) + W e_i + b

where pip_i is the enriched prototype for class ii, combining local lexical semantics and global graph structure.

3. Loss Functions and Training Objectives

Training is conducted in two distinct phases (source and target domain) with losses structured as follows:

3.1 Source Domain Objectives (Supervised)

  • Cross-Entropy Loss: Standard per-sample classification loss on labeled, seen classes.
  • Pairwise Ranking Loss: Encourages correct semantic ordering among prototypes, incentivizing those for semantically related classes to be closer.

Combined,

Lce+pr=cross-entropy+pairwise prototype ranking\mathcal{L}_{ce+pr} = \text{cross-entropy} + \text{pairwise prototype ranking}

3.2 Target Domain Objective (Unsupervised)

  • Information Entropy Loss: Derived from mutual information maximization, this regularizes predictions to have high marginal but low conditional entropy for unlabeled target samples.

Linfo=H(P)+i=1bH(Pvi)\mathcal{L}_{info} = - H(\mathcal{P}) + \sum_{i = 1}^{b} H(\mathcal{P} | v_i)

3.3 Semantic Relation Structure Loss

A novel structured loss guides inter-class relational consistency:

Lsrs=ic1[R(v,pneg)R(ppos,pnegi)]2+[1R(v,ppos)]2\mathcal{L}_{srs} = \sum_{i}^{c-1} \left[ R(v, p_{neg}) - R(p_{pos}, p_{neg}^i) \right]^2 + \left[ 1 - R(v, p_{pos}) \right]^2

  • vv is an image embedding, pposp_{pos} and pnegp_{neg} are positive and negative class prototypes, R(,)R(\cdot, \cdot) is cosine similarity.
  • This enforces that cosine similarity between the image and its correct class prototype is maximized and the relative distances among negative class prototypes are aligned according to their semantic proximity to the positive class.

3.4 Cross-Modal Alignment Retention

To avoid catastrophic forgetting of initial image-text alignment, SRE-CLIP introduces an explicit alignment retention term:

Lalign=i=1cyilog(Softmax(g(e)P)i)\mathcal{L}_{align} = -\sum_{i = 1}^{c} y_i \log \left( \text{Softmax}(g(e) P^\top)_i \right)

  • g(e)g(e) is a text embedding processed through the visual adapter, PP the prototype matrix, and yiy_i the ground-truth label.

3.5 Overall Training Objectives

The total loss for each domain:

  • Source:

Lsource=Lce+pr+βLalign+γLsrs\mathcal{L}_{source} = \mathcal{L}_{ce+pr} + \beta \mathcal{L}_{align} + \gamma \mathcal{L}_{srs}

  • Target:

Ltarget=Linfo+βLalign+γLsrs\mathcal{L}_{target} = \mathcal{L}_{info} + \beta \mathcal{L}_{align} + \gamma \mathcal{L}_{srs}

Empirically, best results are obtained with hyperparameters β=1\beta = 1, γ=0.1\gamma = 0.1.

4. Experimental Setup

Experiments are conducted on two DAZSL benchmarks designed to rigorously test cross-category and cross-domain generalization:

I2AwA: Adapted AwA2 dataset, comprising 50 animal classes (40 seen, 10 unseen). The source domain consists of web images, the target domain of real-world images.

I2WebV: Transfer from ILSVRC 2012 (1,000 classes, web images) to WebVision (5,000 classes), exhibiting significant domain gap and a large number of unseen classes.

  • SRE-CLIP uses CLIP ViT-B/32 as the backbone.
  • WordNet-derived semantic graphs contain 255 nodes for I2AwA and 21,983 nodes for I2WebV, ensuring comprehensive propagation of relational constraints.

Baselines include state-of-the-art ZSL (dGCN, adGCN, bGCN), UDA (DANN, CMD, MME, OSBP), and DAZSL (pmd-bGCN, SROSDA, UODTN, TSCE) models, as well as adapted CLIP-Adapter and CLIP zero-shot.

5. Results and Ablative Analysis

5.1 State-of-the-Art Performance

Method I2AwA (Unseen) I2AwA (H-score) I2WebV (Unseen) I2WebV (H-score)
CLIP zero-shot 90.4 87.9 25.1 28.8
TSCE 63.0 72.2 3.7 6.9
CLIP Adapter 77.1 82.9 22.9 32.6
SRE-CLIP 98.4 96.1 28.5 38.4
  • SRE-CLIP increases unseen accuracy on I2AwA by 21.3 points over CLIP-Adapter and improves H-score by 23.9 over TSCE.
  • On I2WebV, SRE-CLIP yields a 38.4 H-score, outperforming TSCE by 31.5 points.
  • The method maintains strong performance even under extensive domain and category shift.

5.2 Ablation Studies

  • Both the attention adapter and GCN-based prototype learning are necessary; removing either results in a 2–6 point H-score reduction.
  • Exclusion of Lsrs\mathcal{L}_{srs} or Lalign\mathcal{L}_{align} leads to further decreased generalization.
  • Visualization (e.g., t-SNE) demonstrates that SRE-CLIP produces more distinct, semantically coherent clusters for both seen and unseen classes than baselines.

6. Significance and Implications

6.1 Contributions

  • SRE-CLIP is the first framework to combine explicit semantic relation structure and cross-modal alignment in a CLIP-based DAZSL context.
  • The semantic relation structure loss explicitly incorporates external knowledge (WordNet-based graphs) into the visual feature space, enabling nuanced category transfer.
  • Cross-modal alignment retention regularizes adaptation to preserve CLIP’s intrinsic zero-shot capabilities.

6.2 Implications for Research

  • The approach demonstrates the advantages of integrating external ontologies and semantic graphs into foundation model adaptation, suggesting a pathway for further leveraging structured knowledge in vision-language learning.
  • The framework addresses the tradeoff between task-specific adaptation and universal alignment retention, indicating that careful regularization can enhance generalization without sacrificing cross-modal correspondence.
  • The demonstrated scalability on I2WebV implies that this paradigm can extend to open-world and web-scale recognition settings.

This suggests that soft structural constraints and alignment regularization are not limited to DAZSL, but may be generally applicable in robust, domain-aware model adaptation for large pre-trained representations.

7. References to Formulations and Implementation Details

Component Mathematical Formulation or Description
Attention Adapter vi=Wvfi+softmax(Wqfi(Wkfi)dk)Wvfiv_i = W_v f_i + \text{softmax}\left( \frac{W_q f_i (W_k f_i)^\top}{\sqrt{d_k}} \right) W_v f_i
Prototype w/GCN pi=GCN(ei,G)+Wei+bp_i = GCN(e_i, G) + W e_i + b
Semantic Relation Loss Lsrs=ic1[R(v,pneg)R(ppos,pnegi)]2+[1R(v,ppos)]2\mathcal{L}_{srs} = \sum_{i}^{c-1}[R(v, p_{neg}) - R(p_{pos}, p_{neg}^i)]^2 + [1 - R(v, p_{pos})]^2
Alignment Retention Lalign=i=1cyilog(Softmax(g(e)P)i)\mathcal{L}_{align} = -\sum_{i=1}^c y_i \log(\text{Softmax}(g(e)P^\top)_i)
Training Objective Lsource=Lce+pr+βLalign+γLsrs\mathcal{L}_{source} = \mathcal{L}_{ce+pr} + \beta \mathcal{L}_{align} + \gamma \mathcal{L}_{srs}, Ltarget=Linfo+βLalign+γLsrs\mathcal{L}_{target} = \mathcal{L}_{info} + \beta \mathcal{L}_{align} + \gamma \mathcal{L}_{srs}

The codebase is available at https://github.com/yjainqdc/SRECLIP, supporting reproducibility and application across related DAZSL scenarios.


SRE-CLIP demonstrates that principled injection of semantic structures and systematic cross-modal regularization significantly enhance domain-adaptive generalization in vision-LLMs, setting a new standard for robust zero-shot transfer across challenging real-world data regimes (Yu et al., 21 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Semantic Relation-Enhanced CLIP (SRE-CLIP) Adapter.