Unsupervised Joint Loss

Updated 24 November 2025

Unsupervised Joint Loss is a method that combines pseudo-labeling and clustering objectives to learn discriminative features without relying on ground-truth labels.
It integrates a discriminative softmax loss with a center-pooling loss, using dynamic pseudo-centers to balance intra-class compactness with inter-class separation.
Optimized via mini-batch SGD, this approach enhances representation learning for applications in computer vision, domain adaptation, and remote sensing.

Unsupervised joint loss refers to a class of loss function designs and learning protocols in which multiple objectives, typically encompassing both instance-grouping or clustering criteria and feature discrimination, are optimized simultaneously in settings without ground-truth labels. This paradigm leverages the complementarity of different unsupervised signals (e.g. pseudo-label assignment, intra-class compactness, inter-class discrimination, information-theoretic criteria, and structured regularization) within a single end-to-end training regime. The approach is prevalent in representation learning, clustering, unsupervised adaptation, and related fields. The unsupervised joint loss is characterized by intertwined optimization over latent class structures, representation spaces, and sometimes clustering centers, often via mini-batch SGD and dynamic reconstruction of auxiliary targets.

1. Formal Definition and General Scheme

In unsupervised joint loss frameworks, sample features are mapped by a deep model to an embedding space, and pseudo-labels or latent assignments are inferred by clustering prototypes, pseudo-centers, or metric assignment. The loss is comprised of separable terms:

Discriminative loss (e.g., softmax using pseudo-labels):

$L_{\text{softmax}} = - \sum_{i=1}^{|B|} \log \frac{ \exp( W_{0,z_i}^T\phi(x_i) + b_{0,z_i} ) }{ \sum_{j=1}^\Lambda \exp(W_{0,j}^T\phi(x_i) + b_{0,j}) }$

with $\phi(x_i)$ as deep feature for sample $x_i$ , pseudo-label $z_i$ inferred by nearest cluster/center.

Clustering/center-pooling loss:

$L_{\text{pc}} = \sum_{i=1}^{|B|} \|\phi(x_i) - c_{z_i}\|^2$

where $c_{z_i}$ is the center of pseudo-class $z_i$ .

Joint objective:

$L_{\text{total}} = L_{\text{softmax}} + \lambda L_{\text{pc}}$

with $\lambda$ tuning the trade-off between discrimination and compactness.

Pseudo-labels $z_i$ are assigned by:

$z_i = \arg\min_{1 \leq l \leq \Lambda} \|\phi(x_i) - c_l\|$

and centers $c_j$ are updated dynamically as per sample assignment.

This scheme's key architectural principle is end-to-end simultaneous optimization, with backpropagation split between feature extractor, classifier weights, and clustering prototypes (Gong et al., 2019).

2. Pseudo-Label Construction and Dynamic Centers

A distinguishing feature is the automatic creation of class proxies by learning pseudo-centers. All center parameters $c_j$ are typically initialized to zero and updated via gradient descent, with labels $z_i$ recalculated per batch based on nearest-center assignment:

$z_i = \arg\min_{j} \|\phi(x_i) - c_j\|$

The gradients with respect to sample features and centers are:

$\frac{\partial L_{\text{total}}}{\partial \phi(x_i)} = \frac{\partial L_{\text{softmax}}}{\partial \phi(x_i)} + 2\lambda(\phi(x_i) - c_{z_i})$

$\frac{\partial L_{\text{total}}}{\partial c_j} = 2\lambda \sum_{i : z_i = j} (c_j - \phi(x_i))$

This dynamic re-allocation and update alleviate prototype drift and avoid static clustering biases. The continuous evolution of centers and features allows clusters to adapt to the true data distribution, giving rise to stable intra-class compactness and inter-class separation even in the absence of labels (Gong et al., 2019).

3. Gradient Flow and Optimization

Joint unsupervised loss optimization integrates all components into standard SGD. Losses are calculated on each mini-batch:

Compute features and assign pseudo-labels by nearest-center.
Calculate both softmax (classification) and center (clustering) losses.
Backpropagate through the CNN, classifier weights, and center parameters.
Update all parameters, ensuring that the representation learning and clustering interact.

The SGD loop's only deviation from conventional supervised training is the extra nearest-center assignment and the explicit center-loss and center gradient updates. This unified design enables simultaneous progress in both discriminative and grouping objectives, leading to improved feature utility for downstream tasks.

4. Theoretical and Empirical Properties

Pure softmax (pseudo-label only) training fails to guarantee tight, well-separated clusters; pure clustering (center-pull only) lacks cross-class separation. Joint optimization fuses these effects: softmax discrimination encourages features of different (pseudo-)classes to be distinguishable, while the center-pull constrains intra-class variance. Controlled via $\lambda$ , this balance is critical:

If $\lambda$ is too low, clusters remain diffuse.
If $\lambda$ is too high, clusters over-compact and lose class discriminability.

Empirically, optimal $\lambda$ produces features that form clusters aligning closely with true semantic classes, as verified in Ucmerced Land Use and Brazilian Coffee benchmarks (e.g., joint loss yields 94.33% accuracy, outperforming discriminative and clustering baselines, and exhibits robust behavior under $\lambda$ and pseudo-class count sweeps) (Gong et al., 2019).

Unsupervised joint loss appears under different guises across clustering, domain adaptation, and contrastive learning:

Deep clustering with unsupervised joint representation learning: Weighted triplet losses are recurrently mined from evolving cluster assignments and jointly optimized with the embedding network. Triplet construction encourages both local neighborhood structure and global cluster separation (Yang et al., 2016).
Debiased representation + clustering: Statistics pooling blocks (mean, variance, cardinality) are included to mitigate skew and improve cluster assignment fairness, with clustering Kullback-Leibler divergence and reconstruction losses integrated (Rezaei et al., 2021).
Contrastive and discriminative joint losses: InfoNCE-type mutual information estimators are combined with classification or cluster-pull losses, maximizing conditional separation and overall discriminability (Park et al., 2020).

All of the above frameworks instantiate the central principle of distinguishing features via multi-term, unsupervised, end-to-end optimized objectives, without reliance on ground-truth labels.

6. Implications and Impact

Unsupervised joint loss architectures have reshaped feature learning in remote sensing, computer vision, and domain adaptation:

They yield features with much higher downstream classification accuracy than independent clustering or naive pseudo-labeling approaches.
Learned representations generalize robustly across datasets and transfer to recognition and adaptation tasks.
The flexibility to tune intra-class versus inter-class optimization via loss-weighting enables practitioners to control cluster granularity and discriminativeness.

In summary, joint unsupervised loss design—where feature grouping and discrimination are co-optimized through tightly coupled objectives—constitutes a principled, empirically validated approach for unsupervised modeling. It fosters the emergence of semantic structure in representation spaces and forms the backbone of modern unsupervised learning in domains where ground-truth labels are unavailable or expensive (Gong et al., 2019).