Papers
Topics
Authors
Recent
2000 character limit reached

CAPNET: Adaptive Correlation for Visual Recognition

Updated 2 December 2025
  • The paper presents CAPNET, an end-to-end framework that leverages CLIP, learnable soft prompts, and a GCN to model semantic label correlations in long-tailed multi-label recognition.
  • It employs a distribution-balanced focal loss and parameter-efficient fine-tuning, achieving superior performance over state-of-the-art baselines on VOC-LT, COCO-LT, and NUS-WIDE.
  • The framework integrates semantic correlation modeling, test-time ensembling, and PEFT to effectively mitigate data imbalance while ensuring robust inference.

The Correlation Adaptation Prompt Network (CAPNET) is an end-to-end framework for long-tailed multi-label visual recognition, leveraging pre-trained vision-LLMs, specifically CLIP, to explicitly model label correlations and address head-tail data imbalance. CAPNET utilizes a graph convolutional network (GCN) for label propagation, learnable soft prompts for refined class embeddings, a distribution-balanced focal loss with class-aware re-weighting, test-time ensembling for robust inference, and parameter-efficient fine-tuning (PEFT) to prevent overfitting on scarce classes. The framework demonstrates substantial improvements over state-of-the-art baselines on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE (Tang et al., 25 Nov 2025).

1. Architecture Overview

CAPNET is built atop the dual-encoder structure of CLIP, integrating both a vision encoder EI()E_I(\cdot) (e.g., ResNet-50, ViT-Base/16) and a Transformer-based text encoder ET()E_T(\cdot). The system processes an input image x\boldsymbol{x} as follows:

  • Learnable soft prompts {tc}c=1C\{\boldsymbol{t}_c\}_{c=1}^C for each class are fed into ETE_T, producing initial per-class textual features Ft=[f1,,fC]\boldsymbol{F}_t = [\boldsymbol{f}_1,\dots,\boldsymbol{f}_C]^\top.
  • These features are refined via a three-layer GCN over a label correlation graph G\mathcal{G}, generating residuals HL\boldsymbol{H}_L for each class:

Ft=Ft+HL=[f1,,fC]\boldsymbol{F}_t^* = \boldsymbol{F}_t + \boldsymbol{H}_L = [\boldsymbol{f}_1^*,\dots,\boldsymbol{f}_C^*]^\top

  • The image encoder outputs a visual latent v=EI(x)\boldsymbol{v} = E_I(\boldsymbol{x}). Per-class probabilities are computed via cosine similarity and a sigmoid activation:

p(ycx)=σ(cos(v,fc)τ)p(y_c\,|\,\boldsymbol{x}) = \sigma \left( \frac{\cos(\boldsymbol{v}, \boldsymbol{f}_c^*)}{\tau}\right)

where τ\tau is a trainable temperature parameter.

Training employs a binary cross-entropy-style loss, modified to address label imbalance and correlation.

2. Semantic Label Correlation Modeling

CAPNET departs from conventional reliance on co-occurrence statistics—which are unreliable for tail classes—by explicitly constructing a semantic affinity matrix using CLIP's textual encoder. For each class cc, a prompt-encoded representation zc\boldsymbol{z}_c is generated from a template, such as “a photo of a [CLS]”.

The raw correlation matrix is obtained via cosine similarity:

Aij=cos(zi,zj)\mathcal{A}_{ij} = \cos(\boldsymbol{z}_i, \boldsymbol{z}_j)

Self-loop and neighbor weights are balanced:

aij={skiAikAij,ij 1s,i=ja'_{ij} = \begin{cases} \frac{s}{ \sum_{k\neq i} \mathcal{A}_{ik} } \mathcal{A}_{ij}, & i \neq j \ 1-s, & i=j \end{cases}

where s[0,1]s \in [0,1] controls self-loop strength. The final row-normalized correlation matrix, using softmax with temperature τ\tau':

A~ij=exp(aij/τ)kexp(aik/τ)\tilde{\mathcal{A}}_{ij} = \frac{\exp(a'_{ij}/\tau')}{\sum_{k} \exp(a'_{ik}/\tau')}

3. Graph Convolutional Network and Prompt Learning

Each class serves as a node in a three-layer GCN, with features propagated according to the learned label correlation graph:

H(l+1)=σ(D~1/2A~D~1/2H(l)W(l))H^{(l+1)} = \sigma \left( \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H^{(l)} W^{(l)} \right)

where A~=A~+I\tilde{A} = \tilde{\mathcal{A}} + I, D~ii=jA~ij\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}, H(0)=FtH^{(0)} = F_t, and σ\sigma is the ReLU activation. Only prompt tokens, GCN, and adapter weights are updated during training; the remainder of the text encoder is frozen.

Class prompt vectors tc\boldsymbol{t}_c of length MM (empirically, M=4M=4) are initialized as

tc=[V,,V]M tokens,  [CLS]=class name\boldsymbol{t}_c = [V, \ldots, V]_{M\text{ tokens},\; \mathrm{[CLS]=class\ name}}

with token vectors VRdV \in \mathbb{R}^d being learnable.

4. Loss Function and Imbalance Mitigation

CAPNET employs a distribution-balanced focal loss tailored for long-tailed distributions. The class-specific re-weighting factor:

rc=α+σ(β(1nc/Nθ))r_c = \alpha + \sigma \left( \beta \left( \frac{1}{n_c / N} - \theta \right) \right)

with ncn_c the number of positives for class cc and NN the dataset size. A margin term:

vc=κlog(1nc/N1)v_c = \kappa \log \left( \frac{1}{n_c / N} - 1 \right)

is incorporated into the logits.

The per-class loss for an image ii and class cc:

cls(yci,zci)={rc(1qci)γln(qci),yci=1 rcζ(qci)γln(1qci),yci=0\ell_{cls}(y_c^i, z_c^i) = \begin{cases} - r_c (1 - q_c^i)^\gamma \ln(q_c^i), & y_c^i = 1 \ - \frac{r_c}{\zeta} (q_c^i)^\gamma \ln(1 - q_c^i), & y_c^i = 0 \end{cases}

where qciq_c^i is the margin-adjusted, sigmoid-modulated logit, and ζ\zeta, γ\gamma are hyperparameters. The full loss is averaged across all images and classes.

Rare tail classes (ncNn_c \ll N) receive higher rcr_c, amplifying their impact during training, while head class penalization is moderated by the margin and sigmoid mapping.

5. Inference Robustness and Parameter-Efficient Fine-Tuning

For robust test-time prediction, CAPNET employs five-crop ensembling: the input is resized and five spatial crops are generated, with class probabilities averaged:

p=15cr=15p(yx(cr))\boldsymbol{p}_* = \frac{1}{5} \sum_{cr=1}^5 p\left(y\,|\,\boldsymbol{x}_*^{(cr)}\right)

Parameter-efficient fine-tuning is realized via the AdaptFormer scheme in each ViT MLP block:

h=h+sReLU(LN(h)W)W+MLP(LN(h))h_\ell = h'_\ell + s\,\mathrm{ReLU}\left(\mathrm{LN}(h'_\ell) W_{\downarrow}\right) W_{\uparrow} + \mathrm{MLP}\left(\mathrm{LN}(h'_\ell)\right)

where WRd×d^W_{\downarrow} \in \mathbb{R}^{d' \times \hat{d}}, WRd^×dW_{\uparrow} \in \mathbb{R}^{\hat{d} \times d'} with d^d\hat{d} \ll d'. Only the adapter weights, prompts, and GCN parameters are trained. For ViT-Base/16, this requires approximately 6.6 million parameters versus 91.3 million for full fine-tuning.

6. Empirical Evaluation and Ablation

CAPNET delivers substantial improvements on long-tailed multi-label recognition. Key results:

Benchmark Backbone Prior Best (mAP) CAPNET (mAP) CAPNET + TTE + PEFT (mAP)
VOC-LT ResNet-50 84.37 % (MLC-NC) 87.46 % 88.42 %
COCO-LT ResNet-50 60.52 % 64.44 % 65.75 %
VOC-LT ViT-Base/16 87.88 % (LMPT) 93.03 %
COCO-LT ViT-Base/16 66.19 % (LMPT) 76.36 %
NUS-WIDE ViT-Base/16 57.29 % (finetune CLIP) 60.34 %

Ablation studies reveal:

  • Distribution-balanced loss Lcls\mathcal{L}_{cls} improves mAP by up to 1.47%.
  • GCN-based correlation propagation yields further gains.
  • Test-time ensembling (TTE) and PEFT both contribute incrementally, cumulatively resulting in state-of-the-art mAP (e.g., 93.03% / 76.36% on VOC-LT / COCO-LT).
  • Prompt-based cosine correlation matrices outperform data-derived co-occurrence matrices (+1.2% VOC-LT, +2.3% COCO-LT).
  • Optimal GCN self-loop weight s=0.3s=0.3, softmax temperature τ=0.3\tau' = 0.3, and prompt length 4 initialized with “a photo of a [CLS]”.

7. Context and Impact

CAPNET establishes an integrated method for exploiting vision-language priors in long-tailed, multi-label settings by explicitly modeling inter-class label relationships and tailoring the optimization to class imbalance. It unifies prompt tuning, graph-based semantic propagation, distribution-aware losses, robust inference, and resource-efficient adaptation, achieving empirically verified, superior performance on established benchmarks. Contemporary analyses underscore the effectiveness of prompt-based semantic correlations over traditional data-driven label affinity schemes and demonstrate that PEFT with adapters can match or exceed full fine-tuning using an order of magnitude fewer trainable parameters (Tang et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Correlation Adaptation Prompt Network (CAPNET).