CAPNET: Adaptive Correlation for Visual Recognition
- The paper presents CAPNET, an end-to-end framework that leverages CLIP, learnable soft prompts, and a GCN to model semantic label correlations in long-tailed multi-label recognition.
- It employs a distribution-balanced focal loss and parameter-efficient fine-tuning, achieving superior performance over state-of-the-art baselines on VOC-LT, COCO-LT, and NUS-WIDE.
- The framework integrates semantic correlation modeling, test-time ensembling, and PEFT to effectively mitigate data imbalance while ensuring robust inference.
The Correlation Adaptation Prompt Network (CAPNET) is an end-to-end framework for long-tailed multi-label visual recognition, leveraging pre-trained vision-LLMs, specifically CLIP, to explicitly model label correlations and address head-tail data imbalance. CAPNET utilizes a graph convolutional network (GCN) for label propagation, learnable soft prompts for refined class embeddings, a distribution-balanced focal loss with class-aware re-weighting, test-time ensembling for robust inference, and parameter-efficient fine-tuning (PEFT) to prevent overfitting on scarce classes. The framework demonstrates substantial improvements over state-of-the-art baselines on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE (Tang et al., 25 Nov 2025).
1. Architecture Overview
CAPNET is built atop the dual-encoder structure of CLIP, integrating both a vision encoder (e.g., ResNet-50, ViT-Base/16) and a Transformer-based text encoder . The system processes an input image as follows:
- Learnable soft prompts for each class are fed into , producing initial per-class textual features .
- These features are refined via a three-layer GCN over a label correlation graph , generating residuals for each class:
- The image encoder outputs a visual latent . Per-class probabilities are computed via cosine similarity and a sigmoid activation:
where is a trainable temperature parameter.
Training employs a binary cross-entropy-style loss, modified to address label imbalance and correlation.
2. Semantic Label Correlation Modeling
CAPNET departs from conventional reliance on co-occurrence statistics—which are unreliable for tail classes—by explicitly constructing a semantic affinity matrix using CLIP's textual encoder. For each class , a prompt-encoded representation is generated from a template, such as “a photo of a [CLS]”.
The raw correlation matrix is obtained via cosine similarity:
Self-loop and neighbor weights are balanced:
where controls self-loop strength. The final row-normalized correlation matrix, using softmax with temperature :
3. Graph Convolutional Network and Prompt Learning
Each class serves as a node in a three-layer GCN, with features propagated according to the learned label correlation graph:
where , , , and is the ReLU activation. Only prompt tokens, GCN, and adapter weights are updated during training; the remainder of the text encoder is frozen.
Class prompt vectors of length (empirically, ) are initialized as
with token vectors being learnable.
4. Loss Function and Imbalance Mitigation
CAPNET employs a distribution-balanced focal loss tailored for long-tailed distributions. The class-specific re-weighting factor:
with the number of positives for class and the dataset size. A margin term:
is incorporated into the logits.
The per-class loss for an image and class :
where is the margin-adjusted, sigmoid-modulated logit, and , are hyperparameters. The full loss is averaged across all images and classes.
Rare tail classes () receive higher , amplifying their impact during training, while head class penalization is moderated by the margin and sigmoid mapping.
5. Inference Robustness and Parameter-Efficient Fine-Tuning
For robust test-time prediction, CAPNET employs five-crop ensembling: the input is resized and five spatial crops are generated, with class probabilities averaged:
Parameter-efficient fine-tuning is realized via the AdaptFormer scheme in each ViT MLP block:
where , with . Only the adapter weights, prompts, and GCN parameters are trained. For ViT-Base/16, this requires approximately 6.6 million parameters versus 91.3 million for full fine-tuning.
6. Empirical Evaluation and Ablation
CAPNET delivers substantial improvements on long-tailed multi-label recognition. Key results:
| Benchmark | Backbone | Prior Best (mAP) | CAPNET (mAP) | CAPNET + TTE + PEFT (mAP) |
|---|---|---|---|---|
| VOC-LT | ResNet-50 | 84.37 % (MLC-NC) | 87.46 % | 88.42 % |
| COCO-LT | ResNet-50 | 60.52 % | 64.44 % | 65.75 % |
| VOC-LT | ViT-Base/16 | 87.88 % (LMPT) | — | 93.03 % |
| COCO-LT | ViT-Base/16 | 66.19 % (LMPT) | — | 76.36 % |
| NUS-WIDE | ViT-Base/16 | 57.29 % (finetune CLIP) | 60.34 % | — |
Ablation studies reveal:
- Distribution-balanced loss improves mAP by up to 1.47%.
- GCN-based correlation propagation yields further gains.
- Test-time ensembling (TTE) and PEFT both contribute incrementally, cumulatively resulting in state-of-the-art mAP (e.g., 93.03% / 76.36% on VOC-LT / COCO-LT).
- Prompt-based cosine correlation matrices outperform data-derived co-occurrence matrices (+1.2% VOC-LT, +2.3% COCO-LT).
- Optimal GCN self-loop weight , softmax temperature , and prompt length 4 initialized with “a photo of a [CLS]”.
7. Context and Impact
CAPNET establishes an integrated method for exploiting vision-language priors in long-tailed, multi-label settings by explicitly modeling inter-class label relationships and tailoring the optimization to class imbalance. It unifies prompt tuning, graph-based semantic propagation, distribution-aware losses, robust inference, and resource-efficient adaptation, achieving empirically verified, superior performance on established benchmarks. Contemporary analyses underscore the effectiveness of prompt-based semantic correlations over traditional data-driven label affinity schemes and demonstrate that PEFT with adapters can match or exceed full fine-tuning using an order of magnitude fewer trainable parameters (Tang et al., 25 Nov 2025).