Partitioned Adaptive Contrastive Learning (PACL)

Updated 28 November 2025

PACL is a representation learning technique that partitions samples by semantic and statistical criteria to tailor contrastive objectives in heterogeneous data settings.
It employs adaptive loss functions for tasks such as Visual Emotion Recognition and source-free domain adaptation, improving accuracy and robustness.
By blending global class-level and local instance-level losses, PACL effectively addresses noisy labels and diverse data distributions.

Partitioned Adaptive Contrastive Learning (PACL) is a class of representation learning techniques that partitions training samples according to structural or semantic characteristics, and applies adaptively tailored contrastive strategies within each group. In contrast to standard contrastive learning, which applies uniform objectives to all samples, PACL employs a samplewise partitioning criterion—such as factual/emotional alignment or prediction confidence—and leverages partition-specific loss formulations to maximize learning efficacy under noisy or heterogeneous data conditions. The PACL approach has been instrumental for problems characterized by noisy labels, cross-modal alignment, and unsupervised or source-free adaptation, most notably in Visual Emotion Recognition (VER) and source-free domain adaptation contexts (Wu et al., 21 Nov 2025, Zhang et al., 2022).

1. Motivation: Addressing Heterogeneity and Noisy Data

In many learning paradigms, including VER and domain adaptation, training data is marked by heterogeneity—samples differ in label noisiness, semantic relationships, or modality alignment. For example, pre-trained visual models (e.g. CLIP, ResNet) encode “factual-level” semantics but often fail to represent “emotional-level” cues necessary for emotion understanding, a phenomenon called the "affective gap" (Wu et al., 21 Nov 2025). Textual data, such as captions or tweets, tend to embed explicit affective information due to their compositional structure and vocabulary.

Standard contrastive losses impose uniform constraints irrespective of such differences, potentially conflating well-aligned (reliable) and noisy (ambiguous or mismatched) pairs, thereby degrading representation quality. PACL addresses these issues by:

Partitioning data based on semantic or statistical criteria (e.g., factual/emotional match, prediction confidence).
Designing adaptive contrastive objectives (e.g., global class-level vs. local instance-level) for each partition to maximize alignment where reliable and to exploit intrinsic structure where labels/links are noisy (Wu et al., 21 Nov 2025, Zhang et al., 2022).

2. Partitioning Schemes and Criteria

The core step in PACL is partitioning training samples into disjoint sets guided by semantic or statistical affinity. Two canonical approaches have been presented:

Visual Emotion Recognition (VER): Training samples, typically noisy image–text pairs from social media, are assigned to four partitions based on two binary criteria—factual match (via CLIP cosine similarity) and emotional match (via BERTweet embedding of visual adjective-noun pairs and text descriptions). This results in four sets: strong-coupled (factual & emotional match), partial-coupled (one match, one mismatch), and weak-coupled (mismatch on both axes) (Wu et al., 21 Nov 2025).
Source-Free Domain Adaptation (SFUDA): Samples are divided by softmax prediction confidence. “Source-like” samples exhibit high confidence (reliable pseudo-labels, amenable to class-level supervision), and “target-specific” samples yield low confidence (likely mislabeled, best exploited via instance-level self-supervision). Class-conditional splits and memory banks further aid loss construction (Zhang et al., 2022).

Partition statistics, thresholds, and underlying encoders vary by domain and task; for VER, a threshold σ=0.7 yields roughly 22% strong-coupled, 51% partial, and 27% weak samples (Wu et al., 21 Nov 2025). For SFUDA, τ_c=0.95 is typical for source-like extraction (Zhang et al., 2022).

3. Adaptive Contrastive Losses

Within each partition, PACL designs loss functions and sampling strategies that exploit the particular structure or reliability of that group. The table below summarizes key partition-specific objectives for the two main application domains:

Domain	Partition Type	Loss Strategy
Visual Emotion Recognition	Strong-coupled	One-to-one cross-modal contrastive
(Wu et al., 21 Nov 2025)	Partial-coupled	Filtered (factual/emotional) contrastive
	Weak-coupled	Top-similarity positive mining, filtered negatives
Source-Free Domain Adaptation	Source-like	Global supervised class-level loss
(Zhang et al., 2022)	Target-specific	Instance-level local self-supervised loss; K-NN positive mining

In both cases, the per-sample contrastive loss is formulated as a log-ratio of positive to combined positive+negative similarities, using temperature scaling, and with positive/negative sets constructed based on the sample's partition (Wu et al., 21 Nov 2025, Zhang et al., 2022).

For example, in VER, for strong-coupled pairs, positives are the corresponding paired text or image, and negatives are all others; for weakly-coupled, positives are selected by highest cross-modal similarity, and negatives exclude pseudo-matched samples from factual and emotional clusters. In domain adaptation, supervised contrastive loss is used for high-confidence (source-like) samples, and an instance-level local prototype loss for target-specific samples; both are integrated with memory-bank alignment and self-training objectives (Zhang et al., 2022).

4. Algorithmic Pipeline and Training Procedure

PACL training is staged, incorporating sample partitioning, representation clustering, contrastive loss computation, and progressive curriculum over epochs. A prototypical pipeline for VER (Wu et al., 21 Nov 2025) consists of:

Data Partition: Compute factual similarity via CLIP, emotional similarity via BERTweet embeddings of DeepSentiBank adjective-noun phrases; assign to partitions S₁–S₄ based on thresholds.
Representation Clustering: Apply k-means clustering for factual (images) and emotional (texts) pseudo-labels (K=2 typical).
Epoch-wise Training: For each sample in a mini-batch, construct positives and negatives contingent on partition; compute per-pair losses; backpropagate. The training curriculum progressively incorporates partitions: S₁ only in early epochs, then adding S₂/S₃, and finally S₄ in late training.
Cross-modal Architecture: Visual encoder (ResNet, ViT, Swin, etc.) with trainable projector; frozen pre-trained textual encoder (e.g. SKEP) with projector; both projected to a shared embedding space for contrastive similarity.

For source-free domain adaptation (Zhang et al., 2022), the procedure iteratively updates pseudo-labels, memory bank, partitions, and computes class-level, instance-level, and alignment losses in each mini-batch.

5. Network Architecture and Hyperparameterization

The PACL framework relies on modality-specific encoders projected into a shared embedding space, typically 512- or 256-dimensional depending on application. For VER (Wu et al., 21 Nov 2025):

Visual branch: backbone (ResNet-50, ViT-base, Swin-base, CLIP-ViT); trainable.
Textual branch: frozen SKEP encoder plus trainable 2-layer MLP projector.
Cosine similarity computed in embedding space for loss.

Key hyperparameters (VER): partition threshold σ=0.7, clusters K=2, contrastive τ=0.07, optimizer AdamW (backbone lr 1e-3/1e-5, projectors 1e-3), batch size 64, 30 epochs. The textual encoder is always frozen; projectors are trainable (Wu et al., 21 Nov 2025).

For SFUDA (Zhang et al., 2022): encoders (ResNet-50/101/34) with 256-d projection, memory bank, confidence τ_c=0.95, contrastive τ=0.05, and batch size/epochs dependent on dataset.

6. Empirical Performance and Benchmark Results

PACL approaches have demonstrated superior performance on a range of benchmarks:

Visual Emotion Recognition:
- Linear probing: e.g., ResNet-50 + PACL achieves a +5.3% accuracy gain on UnbiasedEmo; outperforms previous “probing pre-training” strategies.
- CLIP-ViT + PACL improves Acc-8 on FI from 71.74% baseline to 73.17%, and mAP on Emotic from 38.50% to 56.13%.
- End-to-end: CLIP-ViT + PACL reaches 80.55% Acc-8 on FI, exceeding SimEmotion; zero-shot: ResNet-50 + PACL attains 31.7% vs. 17.5% for CLIP using text prompts (Wu et al., 21 Nov 2025).
- Qualitative: PACL-enhanced features demonstrate higher activation on emotion-relevant regions.
Source-Free Domain Adaptation:
- VisDA: DaC (a PACL instance) achieves 87.3%, outperforming SHOT, NRC, CPGA; DaC++ reaches 88.6%, close to fully supervised baseline (89.6%).
- Office-Home/DomainNet: DaC boosts average top-1 accuracy compared to strong baselines (Zhang et al., 2022).
- Ablations demonstrate each PACL component’s necessity; e.g., removing instance-level loss or alignment loss degrades performance.

PACL builds upon and extends standard contrastive frameworks (e.g., SimCLR) by merging global class cues with local structure discovery via partitioning. In domain adaptation, the approach controls target risk terms (self-training fit, consistency error, partition divergence) with dedicated loss components, leading to robust theoretical guarantees (Zhang et al., 2022).

Related methodologies include global class-level clustering, local instance contrast, and distribution alignment (e.g., MMD). The PACL paradigm is distinguished by its partition-aware, adaptive mapping of loss type to data reliability and structure, applicable to noisy, uncertain, or cross-modal domains.

In summary, PACL is characterized by strategic data partitioning and tailored contrastive objectives that collectively yield substantial improvements in downstream tasks requiring robust representation learning from noisy, heterogeneous, or unaligned data (Wu et al., 21 Nov 2025, Zhang et al., 2022).