Multi-Label Adaptive Contrastive Learning

Updated 26 December 2025

MACL is a framework that adapts contrastive learning to multi-label data by explicitly modeling label co-occurrence and addressing class imbalance.
It leverages various loss formulations and adaptive strategies to dynamically select and weight positive/negative pairs based on label overlap.
MACL demonstrates improved performance in text classification, image tagging, and remote sensing retrieval through its label-adaptive loss innovations.

Multi-Label Adaptive Contrastive Learning (MACL) is a principled framework for adapting contrastive representation learning to multi-label data scenarios. In contrast to conventional single-label paradigms, MACL explicitly models label-set structure and co-occurrence, corrects for label imbalance, and dynamically modulates positive/negative pair selection and weighting. This family of approaches has demonstrated consistent superiority in tasks such as multi-label text classification, image tagging, and remote sensing retrieval, across both head and tail classes, through empirically validated, label-adaptive loss formulations (Lin et al., 2022, Audibert et al., 12 Apr 2024, Chen et al., 31 Jan 2025, Amir et al., 18 Dec 2025).

1. Fundamental Challenges in Multi-Label Contrastive Learning

Multi-label datasets introduce substantial representational and inferential challenges due to three core factors: (i) semantic overlap—instances often share complex, partial label-set intersections, (ii) label imbalance—heavy-tailed frequency distributions wherein rare (“tail”) labels lack sufficient positive pairs, and (iii) intricately structured co-occurrence patterns—some label pairs are common, others mutually exclusive, requiring adaptive treatment (Amir et al., 18 Dec 2025, Audibert et al., 12 Apr 2024). Inadequate modeling of these properties results in suboptimal representations where dominant labels suppress minority label structure, positive/negative mining lacks semantic granularity, and classification performance on rare categories stagnates.

2. Core MACL Loss Formulations

MACL instantiates a spectrum of loss families, generalizing supervised contrastive learning by parameterizing positive/negative pair construction, weighting, and adaptive temperature scaling. Key representatives include:

Loss Name	Positive Definition	Adaptive Weight/Temp
SCL (Strict Contrastive Loss) (Lin et al., 2022)	Identical label-sets	Hard positives/negatives, no weights
JSCL (Jaccard Similarity CL) (Lin et al., 2022)	All pairs, weighted Jaccard	Soft weights $\propto$ Jaccard similarity
JSPCL (Jaccard Prob CL) (Lin et al., 2022)	Probabilities, weighted Jaccard	Soft weights on output probability space
SLCL (Stepwise Label CL) (Lin et al., 2022)	Per-label, cross-batch	Decompose into independent label streams
ICL (Intra-label CL) (Lin et al., 2022)	Intra-sample label pairs	No cross-sample contrast, intra-sample only
MACL/ABALONE (Audibert et al., 12 Apr 2024)	MoCo-style queue, Jaccard/Prototype	Real positives weighted by Jaccard, prototype positives, repulsion rescaled
MACL-LD (Chen et al., 31 Jan 2025)	Any label overlap (ANY)	Reweighting by learned label distributions (RBF- or CL-based); softmax head
Remote Sensing MACL (Amir et al., 18 Dec 2025)	Per-label positive mining	Pairwise weights: $\propto$ inverse intersection frequency; dynamic temperature

This parameterization enables strict (“only exact matches attract”), soft (“higher overlap, higher positive weight”), or fully adaptive “fuzzy” contrast across the batch, and supplies dynamic negative scaling to guard against head-label dominance.

3. Adaptive Strategies for Positive/Negative Mining and Weighting

Positives in MACL are dynamically defined according to label-set overlap, with multiple adaptive strategies:

Exact-match: Only samples with identical label-sets are positives (SCL).
Soft/Weighted: All pairs are considered; positives are weighted by normalized Jaccard similarity of label-sets (JSCL, JSPCL, MACL/ABALONE).
Per-label: For each positive label in the anchor, all batch samples sharing that label comprise positives (SLCL, Remote Sensing MACL).
Any-overlap (ANY): MACL-LD treats any batch/queue item with intersecting label-set as a positive, with all others as negatives; this is facilitated by momentum-encoded queues for rare labels (Chen et al., 31 Jan 2025).
Label-distribution weighting: MACL-LD further introduces a “distribution head” to recover non-uniform class weights from binary labels, interpolated using RBF or contrastive kernels, and uses these distributions to reweight class-specific contrastive losses.

Negative pairing is correspondingly handled:

By default, any batch/queue sample with no label overlap is a strict negative.
In MACL/ABALONE, repulsion scaling ensures negative pairs from real examples are down-weighted (parameter $\beta$ ), while class prototypes retain maximal negative weight.

4. Frequency-Sensitive and Temperature-Adaptive Extensions

MACL frameworks apply explicit corrections for class imbalance and semantic overlap strength using:

Pairwise Label Reweighting (PLR): Each positive pair is inversely weighted by the logarithm of their intersection frequency over the dataset. This suppresses the over-representation of frequent label-pairs and amplifies the pull of rare co-occurrences (Amir et al., 18 Dec 2025).
Dynamic-Temperature Scaling (DTS): The temperature parameter for the contrastive loss is dynamically reduced for higher Jaccard-overlap (making these pulls sharper), and elevated for rare label-set anchors (widening the feasible representation window). The final temperature for a positive pair $(i,p)$ is:

$T_{i,p} = \exp\bigl(-\alpha\,J(y(i),y(p))\bigr) + \beta / (\log(1 + h(y(i))))$

with $\alpha$ , $\beta$ controlling semantic and rarity scaling (Amir et al., 18 Dec 2025).

Both modules can be ablated or tuned (see §6), demonstrating additive performance gains.

5. Architectural and Algorithmic Implementations

Across domains, MACL solutions share modular architectural features:

Encoder backbone: Text (BERT, RoBERTa), image (ResNet-18/50), or MLP for vector data.
Projection head: Typically a two-layer MLP with batch normalization, output $\ell_2$ -normalized.
Memory queue: Momentum-updated embeddings (MoCo-like) or hard-mined batches for positive expansion, essential for rare label handling (Audibert et al., 12 Apr 2024, Chen et al., 31 Jan 2025).
Distribution head: Single-layer FC + softmax to recover class-importance distributions (Chen et al., 31 Jan 2025).
Training: Mixed-precision, large batch, early stopping by Micro/Macro-F1 or retrieval metrics.

Pretraining proceeds for 80–400 epochs depending on scale; after contrastive pretraining, the projection head is discarded and a linear head is fitted with standard BCE for multi-label prediction, optionally fine-tuning the encoder.

Key hyperparameter regimes:

Batch size: 32 (text), 128 (vision)
Adam(W) optimizer, learning rate $\sim$ 1e-3 for head, $\sim$ 1e-5 for backbone
Temperature $\tau$ in [0.05, 0.3], tuned per dataset
Regularization: weight decay (1e-4 to 1e-3), gradient clipping
Als must tune PLR $\epsilon$ , DTS $\alpha/\beta$ , and label-distribution loss weights

6. Empirical Evaluations and Benchmarks

MACL and its variants achieve state-of-the-art performance across a range of applications and metrics. Key findings include:

Text Classification: On SemEval-2018 (emotion, English/Arabic/Spanish) and Indonesian News, MACL contrastive losses outperform SpanEmo and Indo-BERT+Sim baselines. JSCL and JSPCL yield maximal Macro-F1 and Jaccard gains, with SCL best on strict co-occurrence (Lin et al., 2022).
Long-Tailed Text: On AAPD, RCV1-v2, UK-LEX, MACL/ABALONE (MSC loss) yields absolute Macro-F1 improvements (e.g., AAPD Macro-F1 +0.77 over BCE, +3.33 over classical batch contrastive), with particularly strong impact on tail labels (Audibert et al., 12 Apr 2024).
Image/Vector Data: On MS-COCO, PASCAL VOC, NUS-WIDE, Bookmarks, Delicious, MACL-LD (label-distribution) attains best-in-class Hamming accuracy, Example-F1, and mAP. With distribution modeling enabled, ablations show up to 3–5 point performance increases (Chen et al., 31 Jan 2025).
Remote Sensing Retrieval: On DLRSD, ML-AID, WHDLD, MACL and Wg-MACL yield 2–4 percentage point mAP gains over leading multi-label SupCon baselines, with ablation confirming essentiality of both PLR and DTS (Amir et al., 18 Dec 2025).

Notably, t-SNE and internal cluster metrics evidence that adaptive losses in MACL form sharper, semantically faithful representation manifolds, especially for rare and overlapping classes (Lin et al., 2022).

7. Practical Recommendations and Application Scope

Begin with SCL or SLCL for datasets with highly distinct labels or rare label-focus; use JSCL/JSPCL for partial-overlap rich settings (Lin et al., 2022).
For extreme label-imbalance (long-tailed), ensure inclusion of memory queues and label prototypes, with weighted attraction and repulsion (Audibert et al., 12 Apr 2024).
For vision or heterogeneous-label domains, augment with label-distribution heads for adaptive pair weighting (Chen et al., 31 Jan 2025).
In remote sensing, combine PLR and DTS for robust retrieval–classification transfer (Amir et al., 18 Dec 2025).
Always interpolate BCE and contrastive losses (text: $\alpha$ in [0.2,0.4]; vision: dataset-specific), and tune temperature and weighting hyperparameters by cross-validation.

MACL, as flexible toolbox, enables robust contrastive adaptation to diverse multi-label regimes without requiring hand-engineered sampling or static balancing heuristics. Its methods are readily extensible to multi-modal, semi-supervised, or large-vocabulary settings. Code and pretrained models for state-of-the-art MACL variants are publicly distributed (Amir et al., 18 Dec 2025).