Dynamic Augmented Multi-focus Pseudo-labeling (DAMP)

Updated 13 March 2026

The paper introduces DAMP, a pseudo-label generation method for SPML that uses dynamic augmentation and multi-focus CLIP aggregation to uncover additional true positives.
It leverages both global and local image views along with targeted negative mining to mitigate label noise and prevent false negatives.
Empirical results show state-of-the-art mAP improvements on benchmarks like VOC, COCO, NUS, and CUB, validating its effectiveness in vision-language tasks.

Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) is a pseudo-label generation method designed to address the challenges of Single-Positive Multi-Label Learning (SPML) in computer vision, where each sample carries only a single positive label while all other class labels are missing. DAMP leverages dynamic augmentation and multi-focus view aggregation with CLIP-based image-text models to robustly infer additional pseudo-labels, thereby mitigating false negatives and label noise. DAMP operates as a component of the Adaptive and Efficient Vision-Language Pseudo-Labeling (AEVLP) framework, and in conjunction with the Generalized Pseudo-Label Robust Loss (GPR Loss), delivers state-of-the-art SPML performance (Tran et al., 28 Aug 2025).

1. Underlying Motivation and Key Principles

SPML settings are characterized by severe label sparsity: for each sample, only one positive label is available and the status of all other classes is unobserved. Standard baselines that treat unannotated labels as negatives inject strong bias toward false negatives, harming both recall and precision. Existing approaches, which generate pseudo-labels for missing classes by fixed or sporadically updated models, are often prone to propagating errors, because incorrect pseudo-labels may be reinforced throughout training.

DAMP tackles these weaknesses by introducing two central innovations:

Dynamic Augmentation: Instead of relying on a single static view, DAMP produces a diverse set of augmented image crops (both global and multiple local views) at every training epoch. This repeatedly exposes the pseudo-labelling process to varying visual evidence, reducing the entrenchment of label errors.
Multi-focus Pseudo-labeling: CLIP-based soft-predictions are aggregated across global and local views, nonlinearly boosting the detection of true positives visible only in local detail, while rigorously filtering low-confidence or ambiguous classes as negatives or zeros, respectively.

This methodology leverages the observation that local and global crops reveal complementary evidence for multi-label classification and that dynamic refresh of pseudo-labels prevents “label locking,” thus improving recall and controlling false negatives in SPML tasks.

2. Formal Algorithmic and Mathematical Description

Let an SPML dataset $\mathcal D = \{(x_n, \hat y_n)\}_{n=1}^N$ be given, with $\hat y_n\in\{0,1\}^C$ such that $\sum_i \hat y_{n,i}=1$ . The pseudo-labelling and downstream loss computation proceed as follows:

A. Augmented Views Construction

Global View: $x_n^{\text{global}} = T(x_n)$ , where $T(\cdot)$ applies weak augmentation (random flip, color jitter).
Local Views: The image is spatially partitioned into overlapping $g\times g$ grids. Each grid cell is randomly enlarged ( $r\sim U(r_{\min}, r_{\max})$ ), yielding $R=g^2$ local crops: $x_{n,z}^{\text{local}} = T(P_{n,z})$ , $z=1\ldots R$ .

B. CLIP-based Pseudo-labeling

Each view is encoded with CLIP image encoder $E_v$ and label texts with CLIP text encoder $E_t$ (optionally perturbed with GCN noise).
The cosine similarity between $h=E_v(x)$ and $t_i$ (text embedding) for class $i$ is softmaxed as $s_i = \mathrm{Softmax}(\hat s_i / \tau)$ .
Collect global ( $S_n^{\mathrm{global}}$ ) and per-local patch ( $S_{n,z}^{\mathrm{local}}$ ) class scores.

C. Aggregating and Thresholding

For the known positive index $\hat c$ , set local threshold $\zeta^{\mathrm{local}} = \min(\nu, s_{n,\hat c}^{\mathrm{g}})$ .
For each class $i$ $i$ ,
- Aggregate local scores: $\omega_i = \max_z s_{n,z,i}^{\mathrm{l}}$ , $\psi_i = \min_z s_{n,z,i}^{\mathrm{l}}$ , then
$s_{n,i}^{\mathrm{agg}} = \begin{cases} \omega_i & \omega_i \geq \zeta^{\mathrm{local}} \ \psi_i & \omega_i < \zeta^{\mathrm{local}} \end{cases}$ - Final soft score: $s_{n,i}^{\mathrm{final}} = 0.5(s_{n,i}^{\mathrm{g}} + s_{n,i}^{\mathrm{agg}})$ .

D. Positive and Negative Pseudo-label Extraction

For a global threshold $\zeta^{\mathrm{global}}$ and top- $k$ , positive pseudo-label: $\ell_{n,i}'=1$ if $s_{n,i}^{\mathrm{final}}\geq\zeta^{\mathrm{global}}$ and $i$ is among top- $k$ classes, else $0$.
Negative mining: Compute $s_{n,i}^{\mathrm{avg}} = 0.5(s_{n,i}^{\mathrm{g}} + \frac{1}{R}\sum_{z=1}^R s_{n,z,i}^{\mathrm{l}})$ ; denoting $\theta_{\Delta_{\text{neg}}}$ as the $\Delta_{\text{neg}}$ -th percentile, $\ell_{n,i}=-1$ for $s_{n,i}^{\mathrm{avg}} \leq \theta_{\Delta_{\text{neg}}}$ , else keep $\ell_{n,i}'$ .

E. Downstream Loss

The constructed pseudo-labels $\ell_n \in \{-1,0,1\}^C$ , together with the supervision label $\hat y_n$ , are used by the GPR Loss:

$\mathcal{L}^{\mathrm{GPR}} = \frac{1}{NC}\sum_{n,i} v^{\mathrm{new}}(p_{n,i};\alpha)\,\mathcal{L}_{n,i}^{\mathrm{new}} + \eta\left(\frac{\hat m - m}{C}\right)^2$

where loss terms and weights depend on ( $\hat y$ , $\ell$ ) cases (Section 4.2 in (Tran et al., 28 Aug 2025)).

F. High-Level Training Loop

Step	Main Operation	Key Hyperparameters
Global & Local Aug	Image augmentation, patch extraction	$g=4$ , $r\in[1.0,1.2]$
CLIP Inference	Image-text similarity scoring (per crop, per class)	$\tau$ , GCN noise
Threshold Evaluation	Derive $\zeta^{\mathrm{local}}$ , $\zeta^{\mathrm{global}}$	$\nu=0.8$ , $\zeta^g=0.7$
Score Aggregation	Local-global, min-max for local crops	—
Label Extraction	Top-K positive, bottom $\Delta_{\text{neg}}$ negative	$k=3$ , $\Delta_{\text{neg}}$
Loss/Learning	GPR Loss, Adam optimizer	batch size 8–16, lr=1e–5

3. Implementation and Hyperparameter Settings

DAMP design relies on a set of practical settings for robust operation:

Grid Partition: $g=4$ (CUB: $g=5$ ), yielding $R = g^2$ local crops.
Augmentation: Weak (horizontal flip, color jitter), random enlargements ( $r\sim U(1.0,1.2)$ ).
Positive Thresholds: Local $\nu=0.8$ , global $\zeta^{\mathrm{global}}=0.7$ , and an upper limit of $k=3$ positives per image.
Negative Mining: Bottom $\Delta_{\text{neg}}=20\–30\%$ average CLIP scores per image assigned as negatives; remaining classes handled as per DAMP extraction.
GPR Loss Coupling: All pseudo-labels are processed with GPR Loss, with parameters $q_1=1, q_2=1, q_3=0.3$ , $\lambda_1=0.1, \lambda_2=0.9$ , $\eta=1.0$ .
Optimizer and Training: Adam, learning rate $1\times10^{-5}$ , batch size $8$–$16$, $8$–$10$ epochs.

Training stability is promoted by stopping gradients through $v^i(p;\alpha)$ and $\hat k(p;\beta)$ , linearly warming up $\alpha$ from a conservative initialization, and capping the per-image increment in positive pseudo-labels.

4. Experimental Evaluation and Performance Analysis

DAMP, integrated as part of the AEVLP framework (DAMP + GPR), demonstrates state-of-the-art mAP on four canonical SPML benchmarks:

Dataset	mAP (AEVLP)	mAP (Next Best Prior)
VOC	90.46%	89.83% (GR-Loss), 89.10% (VLPL)
COCO	73.54%	73.17% (GR-Loss), 71.45% (VLPL)
NUS	50.70%	49.59%
CUB	24.89%	24.02%

Ablation studies confirm the significance of each DAMP design feature:

Augmentation and GCN Noise: Removing either drops mAP by $\sim0.3\%$ – $0.5\%$ .
GPR’s Positive Re-weighting: Removing $v^4(p;\alpha)$ reduces mAP by $\sim0.2\%$ .
Negative Loss Term: Removing $\mathcal L^3$ results in a further mAP decrease of $0.1\%$ – $0.2\%$ .
Pseudo-labeling Alone: Using DAMP with BCE yields $58.31\%$ mAP, which increases to $59.90\%$ (+1.6\%) when replacing BCE with GPR (Table 2 in (Tran et al., 28 Aug 2025)).

Pseudo-label recall and purity are empirically high: missing positive labels are recalled at rates $>25\%$ on COCO, with cumulative recall exceeding $27\%$ at $\sim81\%$ precision (Table 3). Negative mining is robust for percentiles between $20\%$ – $30\%$ , with performance insensitive to further increases (Fig. 6). Probability distributions demonstrate that AEVLP consistently pushes true positives toward $1$ and negatives near $0$, outperforming prior GR-Loss (Fig. 5).

5. Comparative Approaches and Context

Traditional SPML approaches that assign all unobserved classes as negatives or generate static pseudo-labels face critical limitations, including excessive false negatives and propagation of model error. DAMP differs considerably by:

Using dynamic augmentations and view diversity to continually refresh pseudo-labels, thereby avoiding label locking.
Exploiting the CLIP model’s vision-language alignment and aggregating per-view evidence, which improves the visibility of occluded or subtle class cues.
Explicitly integrating negative mining as a percentile-based threshold, sidestepping the binary assumption of negative by default.

A plausible implication is that this general dynamic multi-view pseudo-labelling methodology may have relevance in other semi-supervised or weakly supervised settings where natural view diversity can be exploited.

6. Practical Considerations and Recommendations

Practical deployment of DAMP is guided by the following considerations:

Hyperparameter Sensitivity: mAP is consistent in moderate ranges of $g$ , $\nu$ , $\zeta^{\mathrm{global}}$ , and $\Delta_{\text{neg}}$ ; fine-tuning $k$ (positive max per image) may yield marginal gains.
Augmentation Quality: Only weak augmentations are recommended as stronger augmentations may impair label consistency.
Compute Cost: DAMP’s augmentation/multi-crop adds modest inference and memory overhead due to multiple CLIP passes per image, but avoids model retraining or additional networks.
Training Stability: Limiting the growth of positive pseudo-labels per epoch and stopping gradients through certain weighting functions are essential for convergence, as documented in the AEVLP experiments.

7. Significance and Broader Impact

DAMP advances the frontier of SPML by enabling more reliable discovery of missing positives while curbing false negatives and pseudo-label noise. Its integration with the noise-robust GPR Loss yields the AEVLP framework, which achieves new benchmarks across VOC, COCO, NUS, and CUB as per (Tran et al., 28 Aug 2025). DAMP’s systematic use of dynamic, multi-focus, CLIP-powered pseudo-labeling, together with label aggregation and targeted negative mining, provides a general template for robust pseudo-label generation in any regime where ground-truth annotation is highly incomplete but class semantics are available.

Markdown Report Issue Upgrade to Chat

References (1)

More Reliable Pseudo-labels, Better Performance: A Generalized Approach to Single Positive Multi-label Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Augmented Multi-focus Pseudo-labeling (DAMP).