ACFD: Feature Decomposition for CD-FSS

Updated 13 November 2025

The paper demonstrates that decoupling semantic and domain cues via adversarial and contrastive objectives improves cross-domain few-shot segmentation performance.
ACFD employs distinct private and shared branches with channel and spatial attention to separately extract category-relevant and domain-relevant features from deep backbones.
Empirical validation shows that combining adversarial, contrastive, and orthogonality losses yields mIoU improvements up to +1.1% on benchmark datasets.

Adversarial-Contrastive Feature Decomposition (ACFD) is a neural architectural module designed for cross-domain few-shot segmentation (CD-FSS), with the specific goal of explicitly decoupling category-relevant and domain-relevant information in deep backbone features. By imposing an adversarial learning regime on domain cues and a contrastive learning regime on semantic cues, ACFD enables robust adaptation to novel domains and classes with minimal annotated data. It is an integral component of the Divide-and-Conquer Decoupled Network (DCDNet) framework, providing a principled mechanism to mitigate the intrinsic entanglement of semantic and domain-dependent information that impedes cross-domain generalization.

1. Problem Motivation and Theoretical Justification

In CD-FSS scenarios, the main challenge arises from the backbone feature representations produced by standard architectures (e.g., ResNet-50), which inherently entangle high-level semantics (category-relevant) and low-level structural or style-dependent cues (domain-relevant). This entanglement restricts generalization capability and adaptation speed in new domains, particularly when labeled data is sparse. ACFD addresses this by disentangling features into two explicit branches: a private branch (emphasizing category-relevant signals) and a shared branch (emphasizing domain-relevant signals).

Theoretically, decoupling these signals minimizes interference between low-level and high-level cues, offering a clearer separation of semantic clusters in the feature space and removing nuisance domain information from semantic processing. An orthogonality regularization further enforces minimal overlap between these decoupled representations.

2. Mathematical Formulation

ACFD operates by extracting and separating features at two stages of the backbone:

$F^l \in \mathbb{R}^{C_l \times H \times W}$ (lower-level, e.g., output from early backbone layers)
$F^h \in \mathbb{R}^{C_h \times H \times W}$ (higher-level, e.g., output from later backbone layers)

2.1 Shared Feature Extraction

Shared features, intended to capture domain-relevant information, are produced as follows:

$S = SA(F^l) \odot \text{Conv}_{3\times3}(\text{Conv}_{3\times3}(F^l))$

where $SA(\cdot)$ is a spatial-attention map (applied channel-wise), and $\odot$ denotes element-wise multiplication.

2.2 Private Feature Extraction

Private features, encoding category-relevant signals, are extracted as:

$P = CA(F^h) \odot \text{Conv}_{3\times3}(\text{Conv}_{3\times3}(F^h))$

where $CA(\cdot)$ is a channel-attention (squeeze-and-excite style) map.

2.3 Adversarial Loss for Shared Features

An adversarial discriminator $D$ is trained to distinguish the domain of origin (source or target) of the shared features:

$\mathcal{L}_\mathrm{adv} = \frac{1}{N_s} \sum_{i=1}^{N_s} \left[\log D\left(S(x_i^s)\right)\right] + \frac{1}{N_t}\sum_{j=1}^{N_t} \left[\log\left(1 - D\left(S(x_j^t)\right)\right)\right]$

A Gradient Reversal Layer (GRL) inverts gradients to ensure $S$ is optimized to fool $D$ , decorrelating shared features from category information. $D$ is implemented as $\text{MLP} \circ \text{GAP}$ (global average pooling followed by a multilayer perceptron).

2.4 Contrastive Loss for Private Features

A contrastive InfoNCE loss is employed on the private features:

$\mathcal{L}_{\mathrm{cont}} = -\frac{1}{|\Omega|}\sum_{i \in \Omega} \log \frac{\exp(z_i \cdot z_{i^+}/\tau)}{ \sum_{j \in \mathcal{N}(i)} \exp(z_i \cdot z_j/\tau) }$

where $z_i = \mathrm{Proj}(P_i)$ is a normalized embedding for each spatial location, $\Omega$ indexes valid (foreground) pixels, $i^+$ are positive (same class) pairs, $\mathcal{N}(i)$ is a set of negatives, and $\tau$ is the contrastive temperature.

2.5 Orthogonality Regularization

To discourage overlap between shared and private representations, a covariance-based orthogonality penalty is applied:

$\mathcal{L}_\mathrm{ortho} = \frac{1}{B}\sum_{b=1}^B \frac{\| S_b^\top P_b \|_F^2}{\|S_b\|_F \|P_b\|_F}$

where $b$ indexes the batch.

2.6 Composite Loss

The overall training objective is:

$\mathcal{L} = \lambda_{\mathrm{ce}} \mathcal{L}_{\mathrm{ce}} + \lambda_{\mathrm{adv}} \mathcal{L}_{\mathrm{adv}} + \lambda_{\mathrm{cont}} \mathcal{L}_{\mathrm{cont}} + \lambda_{\mathrm{ortho}} \mathcal{L}_{\mathrm{ortho}}$

where $\mathcal{L}_\mathrm{ce}$ is the standard segmentation cross-entropy, and coefficients are typically $\lambda_{\mathrm{adv}}=1.0$ , $\lambda_{\mathrm{cont}}=1.0$ , $\lambda_{\mathrm{ortho}}=0.1$ .

3. Architectural Instantiation

ACFD is realized within a dual-branch architecture where:

Branch	Input Feature	Attention Mechanism	Downstream Role
Shared	$F^l$	Spatial Attention	Domain-relevant, adversarial training
Private	$F^h$	Channel Attention	Category-relevant, contrastive learning

The shared branch comprises two $3 \times 3$ convolution-instant normalization-ReLU blocks and a spatial attention map; its output is adversarially regularized. The private branch mirrors this structure but uses channel attention and a projection head for contrastive embeddings.

Data flow during training is as follows:

Extract $F^l, F^h$ from the backbone
Compute $S = SA(F^l) \odot \text{Conv2l}(F^l)$
Compute $P = CA(F^h) \odot \text{Conv2h}(F^h)$
Fused features are forwarded to the downstream segmentation head (including MGDF module)
Compute all relevant losses
Update network and discriminator parameters in an alternating regime

4. Training Protocol and Hyperparameters

Training employs alternating optimization:

"s_steps": update the segmentation model using the composite loss $\mathcal{L}$
"d_steps": update the discriminator using $\mathcal{L}_\mathrm{disc} = -\mathcal{L}_\mathrm{adv}$

Key hyperparameters for experiments include:

Batch size: 8
Segmentation optimizer: SGD (lr= $10^{-3}$ , momentum=0.9)
Discriminator optimizer: Adam (lr= $10^{-4}$ , weight-decay=0.01)
Contrastive temperature $\tau$ : 0.07
GRL scale: 1.0
Episodic training: each episode comprises a $K$ -shot support set and a query
Pre-training: 20 epochs (Pascal-VOC+SBD); fine-tuning: 40 epochs on the target domain

5. Ablation Studies and Empirical Validation

Empirical evaluation demonstrates ACFD's contribution to state-of-the-art CD-FSS performance across four challenging datasets. The following ablation results characterize its impact:

Integrating ACFD with the baseline (SSP + IFA) yields a $+0.7\%$ mIoU gain on FSS-1000.
Individual contributions:
- $\mathcal{L}_{\mathrm{adv}}$ only: $+0.4\%$
- $\mathcal{L}_{\mathrm{cont}}$ only: $+0.5\%$
- Both: $+0.9\%$
- Both plus $\mathcal{L}_{\mathrm{ortho}}$ : $+1.1\%$
Feature splitting ablation (MGDF present):
- Base only: $80.6\%$ mIoU
- Private + Shared only: $81.4\%$
- Base + Private + Shared: $81.7\%$

These results confirm that:

Contrastive learning sharply defines semantic clusters in the private branch.
Adversarial loss removes category bias from the shared branch.
Orthogonality reduces residual correlation between the two branches.
Fusion via MGDF module compensates for any structural detail loss due to decoupling.

6. Integration with Downstream Modules and Broader Context

Following ACFD, the Matrix-Guided Dynamic Fusion (MGDF) module adaptively integrates base, shared, and private features under spatial guidance, ensuring spatial coherence in the segmentation output. During fine-tuning, the Cross-Adaptive Modulation (CAM) module further enhances generalization by modulating private features based on the shared representation. Collectively, ACFD, MGDF, and CAM constitute a robust pipeline for CD-FSS that explicitly models and leverages the separation between semantic and domain representations for improved adaptation with few labels.

A plausible implication is that the explicit decoupling of feature representations via adversarial-contrastive-orthogonal objectives, as demonstrated by ACFD, could generalize to other cross-domain and low-data regimes where category-domain entanglement constrains performance.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Adversarial-Contrastive Feature Decomposition (ACFD).