SRE-CLIP: Semantic Relation-Enhanced Adapter

Updated 28 October 2025

The paper introduces SRE-CLIP, which integrates attention-based image adapters and semantic graph propagation with CLIP for robust domain-adaptive zero-shot learning.
It employs novel loss functions, including semantic relation structure loss and alignment retention, to maintain cross-modal alignment during adaptation.
Empirical results on I2AwA and I2WebV benchmarks show significant improvements, with up to a 21.3-point increase in unseen accuracy over baseline models.

Semantic Relation-Enhanced CLIP (SRE-CLIP) Adapter

The Semantic Relation-Enhanced CLIP (SRE-CLIP) Adapter is a framework for domain-adaptive zero-shot learning (DAZSL) designed to address the challenge of transferring knowledge from labeled source domains to unlabeled target domains while enabling robust recognition of unseen categories. SRE-CLIP introduces a structured mechanism for integrating inter-class semantic relations and preserves cross-modal alignment, leveraging the representational power of vision-language foundation models such as CLIP.

1. Background and Motivation

Domain-Adaptive Zero-Shot Learning (DAZSL) requires models to generalize across both domain shifts (e.g., web to real-world images) and category shifts (seen to unseen classes). Classical Zero-Shot Learning (ZSL) typically focuses on leveraging class-level semantic information, while Unsupervised Domain Adaptation (UDA) emphasizes transferring knowledge across data domains. DAZSL represents a more challenging setting requiring simultaneous cross-domain and cross-category generalization.

CLIP, pre-trained on large-scale image-text pairs, offers a cross-modal semantic space suitable for such tasks. However, prior work has identified two significant limitations in vanilla and conventional CLIP adapters when applied to DAZSL:

Lack of Semantic Relation Guidance: Standard protocols typically represent each class by a single text prototype (e.g., its name), failing to encode rich hierarchical and relational structure (e.g., subclasses, synonyms, hypernyms).
Degraded Cross-Modal Alignment: Fine-tuning CLIP for target domain adaptation often degrades the crucial initial alignment between image and text representations, undermining zero-shot transfer ability, especially for categories unseen during training.

SRE-CLIP was developed to directly remedy these issues by learning with explicit semantic relation constraints and regularizing cross-modal projections.

2. Model Architecture

SRE-CLIP’s architecture consists of two tightly coupled branches that update both image and class prototype representations:

2.1 Image Encoding and Attention-Based Adapter

The CLIP vision encoder outputs image features $f_i$ .
These are fed through an attention-based adapter defined as:

$v_i = W_v f_i + \text{softmax}\left( \frac{W_q f_i (W_k f_i)^\top}{\sqrt{d_k}} \right) W_v f_i$

where $W_q, W_k, W_v$ are trainable linear projections and $d_k$ is a dimensionality constant.

The adapter enhances the image representation by modulating with content-dependent attention and allows efficient fine-tuning while maintaining the backbone CLIP parameters largely frozen.

2.2 Class Prototype Learning with Semantic Graph Propagation

For each class $C_i$ , class names are augmented using WordNet synonyms; concatenated prompts are encoded via CLIP’s text encoder and averaged to $e_i$ .
A semantic relation graph $G$ is built from WordNet, capturing taxonomic relations among all classes (including ancestors).
To propagate semantic knowledge across related classes, a Graph Convolutional Network (GCN) is applied:

$p_i = GCN(e_i, G) + W e_i + b$

where $p_i$ is the enriched prototype for class $i$ , combining local lexical semantics and global graph structure.

3. Loss Functions and Training Objectives

Training is conducted in two distinct phases (source and target domain) with losses structured as follows:

3.1 Source Domain Objectives (Supervised)

Cross-Entropy Loss: Standard per-sample classification loss on labeled, seen classes.
Pairwise Ranking Loss: Encourages correct semantic ordering among prototypes, incentivizing those for semantically related classes to be closer.

Combined,

$\mathcal{L}_{ce+pr} = \text{cross-entropy} + \text{pairwise prototype ranking}$

3.2 Target Domain Objective (Unsupervised)

Information Entropy Loss: Derived from mutual information maximization, this regularizes predictions to have high marginal but low conditional entropy for unlabeled target samples.

$\mathcal{L}_{info} = - H(\mathcal{P}) + \sum_{i = 1}^{b} H(\mathcal{P} | v_i)$

3.3 Semantic Relation Structure Loss

A novel structured loss guides inter-class relational consistency:

$\mathcal{L}_{srs} = \sum_{i}^{c-1} \left[ R(v, p_{neg}) - R(p_{pos}, p_{neg}^i) \right]^2 + \left[ 1 - R(v, p_{pos}) \right]^2$

$v$ is an image embedding, $p_{pos}$ and $p_{neg}$ are positive and negative class prototypes, $R(\cdot, \cdot)$ is cosine similarity.
This enforces that cosine similarity between the image and its correct class prototype is maximized and the relative distances among negative class prototypes are aligned according to their semantic proximity to the positive class.

To avoid catastrophic forgetting of initial image-text alignment, SRE-CLIP introduces an explicit alignment retention term:

$\mathcal{L}_{align} = -\sum_{i = 1}^{c} y_i \log \left( \text{Softmax}(g(e) P^\top)_i \right)$

$g(e)$ is a text embedding processed through the visual adapter, $P$ the prototype matrix, and $y_i$ the ground-truth label.

3.5 Overall Training Objectives

The total loss for each domain:

Source:

$\mathcal{L}_{source} = \mathcal{L}_{ce+pr} + \beta \mathcal{L}_{align} + \gamma \mathcal{L}_{srs}$

Target:

$\mathcal{L}_{target} = \mathcal{L}_{info} + \beta \mathcal{L}_{align} + \gamma \mathcal{L}_{srs}$

Empirically, best results are obtained with hyperparameters $\beta = 1$ , $\gamma = 0.1$ .

4. Experimental Setup

Experiments are conducted on two DAZSL benchmarks designed to rigorously test cross-category and cross-domain generalization:

I2AwA: Adapted AwA2 dataset, comprising 50 animal classes (40 seen, 10 unseen). The source domain consists of web images, the target domain of real-world images.

I2WebV: Transfer from ILSVRC 2012 (1,000 classes, web images) to WebVision (5,000 classes), exhibiting significant domain gap and a large number of unseen classes.

SRE-CLIP uses CLIP ViT-B/32 as the backbone.
WordNet-derived semantic graphs contain 255 nodes for I2AwA and 21,983 nodes for I2WebV, ensuring comprehensive propagation of relational constraints.

Baselines include state-of-the-art ZSL (dGCN, adGCN, bGCN), UDA (DANN, CMD, MME, OSBP), and DAZSL (pmd-bGCN, SROSDA, UODTN, TSCE) models, as well as adapted CLIP-Adapter and CLIP zero-shot.

5. Results and Ablative Analysis

5.1 State-of-the-Art Performance

Method	I2AwA (Unseen)	I2AwA (H-score)	I2WebV (Unseen)	I2WebV (H-score)
CLIP zero-shot	90.4	87.9	25.1	28.8
TSCE	63.0	72.2	3.7	6.9
CLIP Adapter	77.1	82.9	22.9	32.6
SRE-CLIP	98.4	96.1	28.5	38.4

SRE-CLIP increases unseen accuracy on I2AwA by 21.3 points over CLIP-Adapter and improves H-score by 23.9 over TSCE.
On I2WebV, SRE-CLIP yields a 38.4 H-score, outperforming TSCE by 31.5 points.
The method maintains strong performance even under extensive domain and category shift.

5.2 Ablation Studies

Both the attention adapter and GCN-based prototype learning are necessary; removing either results in a 2–6 point H-score reduction.
Exclusion of $\mathcal{L}_{srs}$ or $\mathcal{L}_{align}$ leads to further decreased generalization.
Visualization (e.g., t-SNE) demonstrates that SRE-CLIP produces more distinct, semantically coherent clusters for both seen and unseen classes than baselines.

6. Significance and Implications

6.1 Contributions

SRE-CLIP is the first framework to combine explicit semantic relation structure and cross-modal alignment in a CLIP-based DAZSL context.
The semantic relation structure loss explicitly incorporates external knowledge (WordNet-based graphs) into the visual feature space, enabling nuanced category transfer.
Cross-modal alignment retention regularizes adaptation to preserve CLIP’s intrinsic zero-shot capabilities.

6.2 Implications for Research

The approach demonstrates the advantages of integrating external ontologies and semantic graphs into foundation model adaptation, suggesting a pathway for further leveraging structured knowledge in vision-language learning.
The framework addresses the tradeoff between task-specific adaptation and universal alignment retention, indicating that careful regularization can enhance generalization without sacrificing cross-modal correspondence.
The demonstrated scalability on I2WebV implies that this paradigm can extend to open-world and web-scale recognition settings.

This suggests that soft structural constraints and alignment regularization are not limited to DAZSL, but may be generally applicable in robust, domain-aware model adaptation for large pre-trained representations.

7. References to Formulations and Implementation Details

Component	Mathematical Formulation or Description
Attention Adapter	$v_i = W_v f_i + \text{softmax}\left( \frac{W_q f_i (W_k f_i)^\top}{\sqrt{d_k}} \right) W_v f_i$
Prototype w/GCN	$p_i = GCN(e_i, G) + W e_i + b$
Semantic Relation Loss	$\mathcal{L}_{srs} = \sum_{i}^{c-1}[R(v, p_{neg}) - R(p_{pos}, p_{neg}^i)]^2 + [1 - R(v, p_{pos})]^2$
Alignment Retention	$\mathcal{L}_{align} = -\sum_{i=1}^c y_i \log(\text{Softmax}(g(e)P^\top)_i)$
Training Objective	$\mathcal{L}_{source} = \mathcal{L}_{ce+pr} + \beta \mathcal{L}_{align} + \gamma \mathcal{L}_{srs}$ , $\mathcal{L}_{target} = \mathcal{L}_{info} + \beta \mathcal{L}_{align} + \gamma \mathcal{L}_{srs}$

The codebase is available at https://github.com/yjainqdc/SRECLIP, supporting reproducibility and application across related DAZSL scenarios.

SRE-CLIP demonstrates that principled injection of semantic structures and systematic cross-modal regularization significantly enhance domain-adaptive generalization in vision-LLMs, setting a new standard for robust zero-shot transfer across challenging real-world data regimes (Yu et al., 21 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Semantic Relation-Enhanced CLIP Adapter for Domain Adaptive Zero-Shot Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Semantic Relation-Enhanced CLIP (SRE-CLIP) Adapter.