CAPNET: Adaptive Correlation for Visual Recognition

Updated 2 December 2025

The paper presents CAPNET, an end-to-end framework that leverages CLIP, learnable soft prompts, and a GCN to model semantic label correlations in long-tailed multi-label recognition.
It employs a distribution-balanced focal loss and parameter-efficient fine-tuning, achieving superior performance over state-of-the-art baselines on VOC-LT, COCO-LT, and NUS-WIDE.
The framework integrates semantic correlation modeling, test-time ensembling, and PEFT to effectively mitigate data imbalance while ensuring robust inference.

The Correlation Adaptation Prompt Network (CAPNET) is an end-to-end framework for long-tailed multi-label visual recognition, leveraging pre-trained vision-LLMs, specifically CLIP, to explicitly model label correlations and address head-tail data imbalance. CAPNET utilizes a graph convolutional network (GCN) for label propagation, learnable soft prompts for refined class embeddings, a distribution-balanced focal loss with class-aware re-weighting, test-time ensembling for robust inference, and parameter-efficient fine-tuning (PEFT) to prevent overfitting on scarce classes. The framework demonstrates substantial improvements over state-of-the-art baselines on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE (Tang et al., 25 Nov 2025).

1. Architecture Overview

CAPNET is built atop the dual-encoder structure of CLIP, integrating both a vision encoder $E_I(\cdot)$ (e.g., ResNet-50, ViT-Base/16) and a Transformer-based text encoder $E_T(\cdot)$ . The system processes an input image $\boldsymbol{x}$ as follows:

Learnable soft prompts $\{\boldsymbol{t}_c\}_{c=1}^C$ for each class are fed into $E_T$ , producing initial per-class textual features $\boldsymbol{F}_t = [\boldsymbol{f}_1,\dots,\boldsymbol{f}_C]^\top$ .
These features are refined via a three-layer GCN over a label correlation graph $\mathcal{G}$ , generating residuals $\boldsymbol{H}_L$ for each class:

$\boldsymbol{F}_t^* = \boldsymbol{F}_t + \boldsymbol{H}_L = [\boldsymbol{f}_1^*,\dots,\boldsymbol{f}_C^*]^\top$

The image encoder outputs a visual latent $\boldsymbol{v} = E_I(\boldsymbol{x})$ . Per-class probabilities are computed via cosine similarity and a sigmoid activation:

$p(y_c\,|\,\boldsymbol{x}) = \sigma \left( \frac{\cos(\boldsymbol{v}, \boldsymbol{f}_c^*)}{\tau}\right)$

where $\tau$ is a trainable temperature parameter.

Training employs a binary cross-entropy-style loss, modified to address label imbalance and correlation.

2. Semantic Label Correlation Modeling

CAPNET departs from conventional reliance on co-occurrence statistics—which are unreliable for tail classes—by explicitly constructing a semantic affinity matrix using CLIP's textual encoder. For each class $c$ , a prompt-encoded representation $\boldsymbol{z}_c$ is generated from a template, such as “a photo of a [CLS]”.

The raw correlation matrix is obtained via cosine similarity:

$\mathcal{A}_{ij} = \cos(\boldsymbol{z}_i, \boldsymbol{z}_j)$

Self-loop and neighbor weights are balanced:

$a'_{ij} = \begin{cases} \frac{s}{ \sum_{k\neq i} \mathcal{A}_{ik} } \mathcal{A}_{ij}, & i \neq j \ 1-s, & i=j \end{cases}$

where $s \in [0,1]$ controls self-loop strength. The final row-normalized correlation matrix, using softmax with temperature $\tau'$ :

$\tilde{\mathcal{A}}_{ij} = \frac{\exp(a'_{ij}/\tau')}{\sum_{k} \exp(a'_{ik}/\tau')}$

3. Graph Convolutional Network and Prompt Learning

Each class serves as a node in a three-layer GCN, with features propagated according to the learned label correlation graph:

$H^{(l+1)} = \sigma \left( \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H^{(l)} W^{(l)} \right)$

where $\tilde{A} = \tilde{\mathcal{A}} + I$ , $\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}$ , $H^{(0)} = F_t$ , and $\sigma$ is the ReLU activation. Only prompt tokens, GCN, and adapter weights are updated during training; the remainder of the text encoder is frozen.

Class prompt vectors $\boldsymbol{t}_c$ of length $M$ (empirically, $M=4$ ) are initialized as

$\boldsymbol{t}_c = [V, \ldots, V]_{M\text{ tokens},\; \mathrm{[CLS]=class\ name}}$

with token vectors $V \in \mathbb{R}^d$ being learnable.

4. Loss Function and Imbalance Mitigation

CAPNET employs a distribution-balanced focal loss tailored for long-tailed distributions. The class-specific re-weighting factor:

$r_c = \alpha + \sigma \left( \beta \left( \frac{1}{n_c / N} - \theta \right) \right)$

with $n_c$ the number of positives for class $c$ and $N$ the dataset size. A margin term:

$v_c = \kappa \log \left( \frac{1}{n_c / N} - 1 \right)$

is incorporated into the logits.

The per-class loss for an image $i$ and class $c$ :

$\ell_{cls}(y_c^i, z_c^i) = \begin{cases} - r_c (1 - q_c^i)^\gamma \ln(q_c^i), & y_c^i = 1 \ - \frac{r_c}{\zeta} (q_c^i)^\gamma \ln(1 - q_c^i), & y_c^i = 0 \end{cases}$

where $q_c^i$ is the margin-adjusted, sigmoid-modulated logit, and $\zeta$ , $\gamma$ are hyperparameters. The full loss is averaged across all images and classes.

Rare tail classes ( $n_c \ll N$ ) receive higher $r_c$ , amplifying their impact during training, while head class penalization is moderated by the margin and sigmoid mapping.

5. Inference Robustness and Parameter-Efficient Fine-Tuning

For robust test-time prediction, CAPNET employs five-crop ensembling: the input is resized and five spatial crops are generated, with class probabilities averaged:

$\boldsymbol{p}_* = \frac{1}{5} \sum_{cr=1}^5 p\left(y\,|\,\boldsymbol{x}_*^{(cr)}\right)$

Parameter-efficient fine-tuning is realized via the AdaptFormer scheme in each ViT MLP block:

$h_\ell = h'_\ell + s\,\mathrm{ReLU}\left(\mathrm{LN}(h'_\ell) W_{\downarrow}\right) W_{\uparrow} + \mathrm{MLP}\left(\mathrm{LN}(h'_\ell)\right)$

where $W_{\downarrow} \in \mathbb{R}^{d' \times \hat{d}}$ , $W_{\uparrow} \in \mathbb{R}^{\hat{d} \times d'}$ with $\hat{d} \ll d'$ . Only the adapter weights, prompts, and GCN parameters are trained. For ViT-Base/16, this requires approximately 6.6 million parameters versus 91.3 million for full fine-tuning.

6. Empirical Evaluation and Ablation

CAPNET delivers substantial improvements on long-tailed multi-label recognition. Key results:

Benchmark	Backbone	Prior Best (mAP)	CAPNET (mAP)	CAPNET + TTE + PEFT (mAP)
VOC-LT	ResNet-50	84.37 % (MLC-NC)	87.46 %	88.42 %
COCO-LT	ResNet-50	60.52 %	64.44 %	65.75 %
VOC-LT	ViT-Base/16	87.88 % (LMPT)	—	93.03 %
COCO-LT	ViT-Base/16	66.19 % (LMPT)	—	76.36 %
NUS-WIDE	ViT-Base/16	57.29 % (finetune CLIP)	60.34 %	—

Ablation studies reveal:

Distribution-balanced loss $\mathcal{L}_{cls}$ improves mAP by up to 1.47%.
GCN-based correlation propagation yields further gains.
Test-time ensembling (TTE) and PEFT both contribute incrementally, cumulatively resulting in state-of-the-art mAP (e.g., 93.03% / 76.36% on VOC-LT / COCO-LT).
Prompt-based cosine correlation matrices outperform data-derived co-occurrence matrices (+1.2% VOC-LT, +2.3% COCO-LT).
Optimal GCN self-loop weight $s=0.3$ , softmax temperature $\tau' = 0.3$ , and prompt length 4 initialized with “a photo of a [CLS]”.

7. Context and Impact

CAPNET establishes an integrated method for exploiting vision-language priors in long-tailed, multi-label settings by explicitly modeling inter-class label relationships and tailoring the optimization to class imbalance. It unifies prompt tuning, graph-based semantic propagation, distribution-aware losses, robust inference, and resource-efficient adaptation, achieving empirically verified, superior performance on established benchmarks. Contemporary analyses underscore the effectiveness of prompt-based semantic correlations over traditional data-driven label affinity schemes and demonstrate that PEFT with adapters can match or exceed full fine-tuning using an order of magnitude fewer trainable parameters (Tang et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Correlation Adaptation Prompt Network (CAPNET).