NUCE Loss: Uncertainty-Aware Contractive Embeddings

Updated 18 December 2025

The paper introduces NUCE Loss, which combines per-sample uncertainty reweighting with prototype-based contraction to address extreme class imbalance and ambiguous cases.
The method leverages uncertainty-aware weighting to focus on low-confidence samples, thereby creating more calibrated and separable feature embeddings.
Empirical results show that NUCE Loss improves classification accuracy and F1-scores in both malicious content detection and medical imaging scenarios.

The Novel Uncertainty-aware Contractive Embedding (NUCE) Loss is a supervised embedding and classification loss designed to address severe class imbalance and sample uncertainty in both positive-unlabeled (PU) and conventional settings. NUCE integrates uncertainty-aware weighting and contractive regularization, yielding more calibrated, separable, and robust feature spaces, particularly for tasks with rare, ambiguous, or noisy class boundaries. Initially introduced for malicious content detection under PU learning (Hossain et al., 1 Dec 2025) and further extended to medical imaging applications such as cytoplasmic string detection in human embryo time-lapse videos (Sohail et al., 10 Dec 2025), NUCE loss unifies sample-specific uncertainty quantification and prototype-based embedding contraction in a plug-and-play optimization framework.

1. Motivation and Problem Context

NUCE addresses two principal challenges common in imbalanced and difficult classification scenarios:

Extreme Class Imbalance and Ambiguity: In domains such as PU learning—where a small set $D^1 = \{(x_i, +1)\}$ is labeled positive and most examples $D^U = \{x_j\}$ remain unlabeled—standard contrastive or cross-entropy objectives fail to properly separate positive and negative instances due to ambiguous or misclassified samples. This issue is further exacerbated in medical imaging settings (e.g., cytoplasmic string detection), where the minority class may constitute as little as 2% of all examples and distinguishing features are subtle and low-contrast.
Latent Feature Compactness and Separability: Traditional prototype-based regularizers (e.g., Center Loss, Affinity Loss) encourage intra-class compactness but ignore sample-level uncertainty, while standard risk minimization underweights minority and hard samples.

NUCE solves these challenges by (i) introducing per-sample uncertainty-driven loss reweighting, which focuses model attention on ambiguous and hard-to-classify instances, and (ii) adding a contractive term that explicitly pulls embeddings toward their class prototypes or anchors, improving separability and tightness of clusters in the latent space (Hossain et al., 1 Dec 2025, Sohail et al., 10 Dec 2025).

2. Mathematical Formulation and Derivation

The mathematical structure of NUCE varies slightly depending on the underlying task but shares a common principle across applications.

2.1 Uncertainty-Aware Reweighting

For each sample $x_i$ with predicted softmax probability vector $p(x_i)$ , a sample-specific uncertainty weight is defined: $w(x_i) = 1 - \max_k p_{i,k}$ or, more generally (for exponent $\gamma \ge 0$ ): $\omega_i = (1 - \max_k p_{i,k})^{\gamma}$ This upweights low-confidence (ambiguous) cases, penalizing errors more heavily where the model is less certain.

2.2 NUCE Loss for Classification With Prototypes

Given embeddings $H = [h_1; ...; h_B]$ , class indices $y_i \in \{1, ..., K\}$ , and learnable class anchors $A = [a_1; ...; a_K]$ , the full objective is: $\mathcal{L}_{\mathrm{NUCE}} = \lambda_r \left( -\frac{1}{B}\sum_{i=1}^B \omega_i \log p_{i,y_i} \right) + \lambda_c \left( \frac{1}{2B} \sum_{i=1}^B \| h_i - a_{y_i} \|_2^2 \right)$ where $\lambda_r, \lambda_c$ control the balance between classification risk and embedding contraction (Sohail et al., 10 Dec 2025).

2.3 NUCE for Positive–Unlabeled Contrastive Representation Learning

In PU contexts, define a mini-batch $S$ and auxiliary anchor positives $S^a$ , with the encoder $f_\theta$ mapping inputs to $z$ . For each $x_i \in S$ :

Positives: $B_1(x_i)$
Negatives: $B_0(x_i)$

Pairwise InfoNCE-based loss: $\ell(z_i, z_j) = -\log \frac{\exp(\mathrm{sim}(z_i, z_j) / \tau)}{\sum_{z_k \in A(x_i)} \exp( \mathrm{sim}(z_i, z_k) / \tau )}$ NUCE loss is then: $\mathcal{L}_\mathrm{NUCE} = \frac{1}{R}\sum_{i=1}^R w(x_i) \left[ \mathbb{I}(x_i) \frac{1}{|B_1(x_i)|} \sum_{x_p \in B_1(x_i)} \ell(z_i, z_p) + (1 - \mathbb{I}(x_i)) \frac{1}{|B_0(x_i)|} \sum_{x_n \in B_0(x_i)} \ell(z_i, z_n) \right]$ where $\mathbb{I}(x_i)$ is the positive indicator (Hossain et al., 1 Dec 2025).

2.4 Contractive Effect

The combination of uncertainty weights and explicit anchor/prototype contraction pulls features for ambiguous or minority samples closer to their class centers or positive anchors, while simultaneously sharpening decision boundaries.

3. Optimization Framework and Training Details

NUCE is implemented in a mini-batch SGD framework with the following key aspects:

Feature Extraction: Embeddings are produced by a domain-appropriate backbone (e.g., Bi-LSTM with self-attention (Hossain et al., 1 Dec 2025), ViT-B/DeiT-B/CLIP-B/32 transformers (Sohail et al., 10 Dec 2025)), typically fine-tuned from pretrained weights.
Risk and Contractive Loss Components: Both terms are computed in the forward pass; anchors / prototypes are usually learnable and updated jointly with model weights.
Adaptive Hyperparameters:
- $\lambda_r$ , $\lambda_c$ are tuned to balance task-specific emphasis between risk minimization and cluster compactness.
- Uncertainty exponent $\gamma$ controls reweighting sharpness ( $\gamma = 2$ , see (Sohail et al., 10 Dec 2025)).
- Batch size $B$ and learning rate are standard tuning handles.

A representative training loop is:

for epoch in range(E):
    for batch in data_loader:
        X, Y = batch
        H = f_theta(X)                # Feature extraction
        U = H @ W.T                   # Logits
        P = softmax(U, dim=1)         # Probabilities
        omega = (1 - P.max(dim=1))**gamma
        L_risk = - (omega * log(P[range(B), Y])).mean()
        L_contract = 0.5 * ((H - A[Y])**2).sum() / B
        loss = lambda_r * L_risk + lambda_c * L_contract
        loss.backward()
        optimizer.step()

(Sohail et al., 10 Dec 2025)

4. Architectures and Extensions

4.1 Backbones

Self-Attention-Guided Bi-LSTM: For sequential data (e.g., malicious session detection), combines bidirectional LSTM hidden states with a soft self-attention mechanism to produce a fixed-length representation $z$ (Hossain et al., 1 Dec 2025).
Transformer Variants: ViT-B, Swin-B, DeiT-B, DINOv2-B, and CLIP-B/32 are validated as effective image encoders for embedding cytoplasmic strings, supporting the assertion that NUCE is compatible with diverse backbones (Sohail et al., 10 Dec 2025).

4.2 Temperature Scaling in Contrastive Settings

An adaptive temperature

$\tau = \frac{\sigma(\mathbf{v}_D)}{\ln(1+\mathrm{epoch})}$

modulates the contrastive margin, annealing from high stability (large $\tau$ ) in early epochs to higher selectivity later, synchronized with batch embedding variance (Hossain et al., 1 Dec 2025).

4.3 Stabilization Tricks

Early warm-up (with contrastive head frozen) promotes classification head calibration.
Gradient clipping (norm 1.0), weight decay ( $10^{-5}$ ), and random negative mining are used for regularization and efficiency (Hossain et al., 1 Dec 2025).

4.4 Plug-and-Play Compatibility

NUCE is designed for drop-in use with minimal architectural modifications required—only the classification and loss layers need to be adapted, with direct benefits for any backbone supporting embedding-space computation (Sohail et al., 10 Dec 2025).

5. Empirical Validation, Ablation, and Visualization

5.1 Performance in PU and Medical Imaging

PU Learning (Malicious Content Detection): On a 15,000-sample benchmark, NUCE-driven representations enable traditional classifiers (Logistic Regression, SVM, Gradient Boosting) to achieve >93.38% accuracy, >0.93 precision, 1.00 recall, and F1 ≈ 0.9657 (Hossain et al., 1 Dec 2025).
Cytoplasmic String Detection: Across all five transformer backbones, NUCE consistently yields higher F1-score than cross-entropy, focal, Center, and Affinity Loss baselines. ViT-B backbone: F1 improves from 82.71 (CE) to 88.67 (NUCE); CLIP-B/32: 70.92 to 90.87 (Sohail et al., 10 Dec 2025).

5.2 Qualitative and Ablation Analyses

Separation in Embedding Space: NUCE produces tighter and more separated clusters (PCA/t-SNE visualization) than focal or classic CE, supporting its capacity for calibrated representation.
Component Ablation: Removing uncertainty weighting or contractive term reduces F1-score by 4~7 points, confirming both elements contribute additively (Sohail et al., 10 Dec 2025).
Hyperparameter Robustness: Moderate tuning of $\lambda_r$ , $\lambda_c$ , and $\gamma$ maintains F1 within a narrow performance window (Sohail et al., 10 Dec 2025).
Attention Maps: Models trained with NUCE direct attention more sharply to class-discriminative spatial regions than those trained with CE or focal loss.

5.3 Training Behavior

NUCE converges rapidly (contrastive loss 0.52 → 0.01 in 10 epochs (Hossain et al., 1 Dec 2025)).
Secondary triplet-refinement stages further sharpen cluster boundaries.

6. Strengths, Limitations, and Applicability

Strengths

Imbalance Robustness: Directly addresses skewed class distributions, especially for rare and ambiguous examples.
Enhanced Feature Compactness: Empirically produces more discriminative and structured latent spaces conducive to high downstream classification performance.
Architectural Flexibility: The loss is compatible with any embedding backbone, requiring only simple modification to the loss and classification head.
Empirical Stability: Robust across transformer architectures and to moderate hyperparameter variation.

Limitations

Frame-level Modality: Temporal structure remains unmodeled in frame-oriented applications (e.g., cytoplasmic string detection).
Site Generalization: Medical results are reported on single-center data; generalization to broader populations is untested.
Anchor Learning: Prototypes/anchors may benefit from more sophisticated update rules or explicit margin-based refinements (Sohail et al., 10 Dec 2025).

Generalization Potential

NUCE’s joint focus on uncertainty reweighting and prototype contraction is applicable to other domains facing extreme imbalance or uncertain labels, such as small lesion detection or ambiguous-label medical imaging tasks, as well as generic scenarios with underrepresented or ambiguous classes (Sohail et al., 10 Dec 2025).

7. Broader Impact and Future Directions

The NUCE loss unifies advances from contrastive representation learning, uncertainty modeling, and prototype-based clustering into a single, lightweight framework. Its empirical success across PU text mining and fine-grained medical imaging demonstrates its applicability to a broad class of imbalanced and noise-prone tasks. Potential future developments include extension to temporal models, anchor momentum updates, explicit margin-based objectives, and large-scale, multi-institutional validation in medical applications. By directly tackling the bottlenecks of hard/minority sample learning and embedding structure, NUCE constitutes a practical advancement in the design of robust, calibrated feature learning objectives (Hossain et al., 1 Dec 2025, Sohail et al., 10 Dec 2025).

PDF Markdown Chat (Pro)

References (2)

Learning Robust Representations for Malicious Content Detection via Contrastive Sampling and Uncertainty Estimation (2025)

Cytoplasmic Strings Analysis in Human Embryo Time-Lapse Videos using Deep Learning Framework (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Novel Uncertainty-aware Contractive Embedding (NUCE) Loss.