Hierarchical Multi-Label Contrastive Learning

Updated 31 March 2026

Hierarchical Multi-Label Contrastive Learning (HMCE) is a framework that uses hierarchical labels to enforce level-wise contrastive objectives for capturing semantic similarities across multiple granularities.
It integrates specialized loss functions and constraint mechanisms to maintain both coarse and fine semantic relationships through level-specific projection heads and adaptive weighting.
HMCE enhances performance across diverse domains such as vision, text, and bioinformatics by improving classification, clustering, retrieval, and cross-domain transfer, as evidenced by benchmark gains.

Hierarchical Multi-Label Contrastive Learning (HMCE) is a family of representation learning frameworks designed to incorporate fine-grained, multi-level supervisory signals determined by a hierarchy of labels. Unlike traditional contrastive learning, which typically relies on independent labels or views, HMCE explicitly leverages the relationships among labels defined in trees or directed acyclic graphs, enforcing both level-wise structure and consistency across the hierarchy. This approach produces embeddings that preserve semantic similarity at multiple granularities and enables improved performance on classification, clustering, retrieval, and transfer across domains with rich taxonomies (Zhang et al., 2022).

1. Problem Setting and Motivating Principles

The HMCE paradigm operates on data where each instance $x_i$ is associated with a label-path $Y_i = (y_i^0, y_i^1, \ldots, y_i^L)$ traversing a hierarchy $H$ (tree or DAG). Here, $y_i^\ell$ encodes the label at hierarchy level $\ell$ , ranging from the coarsest ( $\ell=0$ , e.g., super-category) to the finest ( $\ell=L$ , e.g., instance ID). A fundamental principle is that label relationships are not flat—instances can be semantically similar at one level but diverge at another. By constructing contrastive objectives at each level, HMCE enforces that representations reflect these nuanced similarities, beyond what single-label or conventional multi-label contrastive techniques achieve (Zhang et al., 2022, Ott et al., 1 Oct 2025, Ghanooni et al., 4 Feb 2025).

For a given anchor $x_i$ , positives at level $\ell$ are defined via their lowest common ancestor: pairs sharing ancestry up to $\ell$ (but not beyond) are considered positives at that level. This hierarchical positive-negative assignment generalizes standard paradigms (SimCLR, SupCon), reducing to SimCLR if no labels exist and to SupCon when a single label is given (Zhang et al., 2022).

2. Core Loss Functions and Hierarchy Enforcement

HMCE frameworks typically decompose their objectives into level-wise components, each with specialized penalties:

(a) Hierarchical Multi-Label Contrastive Loss (HiMulCon):

Averaged over all levels, this loss weighs each positive pair according to the level, e.g.

$L^{HMC} = \sum_{\ell=0}^L \frac{1}{|L|} \sum_{i=1}^N \left(-\frac{\lambda_\ell}{|P_\ell(i)|} \sum_{p\in P_\ell(i)}L^{pair}(i,p) \right)$

where $P_\ell(i)$ are positives at level $\ell$ for $x_i$ and $\lambda_\ell$ is a monotonic, level-dependent penalty (linear or exponential). Finer levels have higher $\lambda_\ell$ , thus force closer alignment (Zhang et al., 2022, Ott et al., 1 Oct 2025).

(b) Hierarchical Constraint-Enforcing Loss (HiConE):

Soft constraints ensure that samples are never pulled closer at a coarser level than a finer one. This is implemented by

$L^{HCE} = \sum_{\ell=0}^L \frac{1}{|L|} \sum_{i=1}^N \left(-\frac{1}{|P(i)|} \sum_{p\in P_\ell(i)} \max[L^{pair}(i,p), L^{pair}_{max}(\ell-1)] \right)$

so the maximum pairwise loss at any level $\ell$ cannot exceed its parent level (Zhang et al., 2022, Liu et al., 3 Jul 2025).

(c) Combined HMCE Loss:

A common strategy is to combine these terms: $L^{HMCE} = \sum_{\ell=0}^L \frac{1}{|L|} \sum_{i=1}^N \left(-\frac{\lambda_\ell}{|P(i)|} \sum_{p\in P_\ell(i)} \max[L^{pair}(i,p), L^{pair}_{max}(\ell-1)]\right)$ By construction, this unified loss is data-driven and adapts to arbitrary taxonomy sizes and shapes, ensuring both level-weighted attraction and global constraint enforcement (Zhang et al., 2022).

In some more recent variants, e.g., HCAL (Jiang et al., 19 Aug 2025), hierarchy is enforced with prototype contrastive heads and adaptive loss balancing mechanisms to further improve semantic consistency and address level-wise convergence imbalances.

3. Model Architectures and Sampling Techniques

HMCE is compatible with a wide range of encoders—ResNet-50 for vision, BERT for text, PointNet++ for point clouds, transformer variants for biological sequences (Zhang et al., 2022, Ott et al., 1 Oct 2025, Liu et al., 3 Jul 2025). Architectures may include:

Level-specific projection heads: Each hierarchy level may be equipped with its own projection head, allowing embedding specialization (Ghanooni et al., 4 Feb 2025).
Feature identification masks: Level-specific binary masks (G-HMLC) or attention mechanisms (A-HMLC) identify which features contribute to the representations at each hierarchy level (Ott et al., 1 Oct 2025).
Prototype-based organization: Models like HCAL assign learnable prototypes to each label, sometimes perturbed by noise, and contrast samples to these prototypes, enforcing cluster compactness (Jiang et al., 19 Aug 2025).

Sampling and batch construction:

Level-wise masks are built to efficiently identify pairs sharing a specific ancestor.
For each anchor, a positive is sampled per level from instances sharing ancestry at that level but diverging below.
Negatives are chosen to reflect level structure; some frameworks further refine this by mining local hard negatives (siblings and descendants) (Chen et al., 2024).

This approach ensures that embeddings from all levels can be learned efficiently, regardless of hierarchy depth or class imbalance.

4. Extensions and Generalizations

Recent advances have enriched HMCE with several methodological and domain-specific augmentations:

Multi-modal embedding fusion: For tasks like protein–protein interaction (PPI), sequence embeddings are fused with annotation embeddings, and graph neural networks further propagate hierarchical information through structured networks (Liu et al., 3 Jul 2025).
Hierarchy-informed weighting: Adaptive weighting of contrastive losses across levels has been introduced, often using convergence-based reweighting schemes to mitigate optimization bias (Jiang et al., 19 Aug 2025).
Alternate geometry: Hierarchy-preserving objectives have been instantiated in Euclidean and hyperbolic spaces, with explicit ancestor-dependent margin constraints and pairwise weighting, improving taxonomy faithfulness in embeddings (Khan, 5 Nov 2025).
Unsupervised and semi-supervised HMCE: Approaches such as CsMl exploit pseudo-labeling and hierarchical semantic alignment in the absence of explicit labels, defining views at multiple levels using nearest-neighbor sets and mixup-style augmentation (Xu et al., 2020).
Label-similarity integration: Methods such as LASCL factor in the similarity of different labels based on hierarchy (e.g., class centers or paths), scaling negative sampling in the contrastive loss accordingly (Lian et al., 2024).

5. Empirical Results, Advantages, and Limitations

HMCE methods consistently demonstrate improvements over flat and single-level contrastive baselines across a variety of benchmarks—vision (ImageNet, CIFAR-100, ModelNet40, DeepFashion), text (WOS, RCV1, NYT), biological sequences (STRING, SHS27k, SHS148k), and medical imaging (BreakHis).

Example results (Zhang et al., 2022, Ott et al., 1 Oct 2025, Liu et al., 3 Jul 2025):

On ModelNet40, HMCE achieves up to 88.46% top-1 accuracy, compared to 81.60% for SupCon.
In PPI prediction, HMCE delivers 2.9–10.9% higher micro-F1 under challenging splits and significantly better zero-shot cross-species transfer (Liu et al., 3 Jul 2025).
Medical imaging applications yield 8–13 percentage point improvements in hierarchical F1 and 33–45% reductions in taxonomy violation rates (Khan, 5 Nov 2025).

Key benefits:

Substantially increased accuracy—especially for low-shot and generalization to unseen classes.
Embeddings exhibit improved cluster structure at both coarse and fine granularity, as measured by NMI, HF1, retrieval MAP, and visually via t-SNE (Zhang et al., 2022, Ott et al., 1 Oct 2025).
Robustness to hierarchy order, class imbalance, and domain-specific noise.

Limitations and open considerations:

In domains with noisy or rapidly changing hierarchies, enforcing strong constraints may lead to over-regularization.
The sampling of truly hard negatives at deep levels can pose computational and memory challenges in very large taxonomies.
Explicit structure encoders may suffer from poor parameter scaling for extremely large label trees (addressed via lightweight HMCE variants such as HiLight (Chen et al., 2024)).

6. Evaluation Metrics and Hierarchy-Aware Assessment

Proper evaluation of HMCE models requires metrics sensitive to hierarchical structure. Common metrics include:

Hierarchical F1 (HF1): Accounts for overlap in ancestor sets between prediction and ground truth, emphasizing correctness at all levels (Khan, 5 Nov 2025).
Tree-distance-weighted accuracy (H-Acc): Measures proximity in the hierarchy between prediction and ground truth.
Parent-distance violation rate: Quantifies the fraction of predictions in which a child is mapped closer to a non-parent than its true parent.
Cluster compactness and separation: $L_2$ distances and NMI at multiple levels confirm the preservation of both intra-cluster tightness and inter-cluster separation (Lian et al., 2024).

These metrics are critical for demonstrating not only flat classification performance but also faithfulness to the underlying taxonomy, which is central to the HMCE objective.

7. Practical Considerations and Algorithmic Summary

Modern HMCE systems are compatible with existing contrastive learning pipelines with minimal architectural changes—often a matter of adding level-wise heads, level and mask-aware sampler modules, and replacing or augmenting the loss function.

Key implementation aspects:

Batch construction: Efficient vectorized computation of all level-wise pairwise similarities and masks.
Loss balancing: Monotonic or adaptive per-level loss reweighting.
Augmentation: Diverse views per sample (standard crops, flips, text augmentations) for robust representation.
Prototype maintenance: Prototypes (possibly perturbed) are updated via mean aggregation or EMA; some methods use label-text encodings for initialization (Jiang et al., 19 Aug 2025, Lian et al., 2024).
Training and inference: Encoders are often pretrained with HMCE, and a lightweight classifier is then trained on frozen features for downstream tasks (Zhang et al., 2022).

Adaptation to new tasks is facilitated by modularity: the same HMCE infrastructure can be applied to images, text, graphs, or biomolecular sequences with appropriate encoders.

References

"Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework" (Zhang et al., 2022)
"Feature Identification for Hierarchical Contrastive Learning" (Ott et al., 1 Oct 2025)
"Multi-level Supervised Contrastive Learning" (Ghanooni et al., 4 Feb 2025)
"Hierarchy-Consistent Learning and Adaptive Loss Balancing for Hierarchical Multi-Label Classification" (Jiang et al., 19 Aug 2025)
"Climbing the label tree: Hierarchy-preserving contrastive learning for medical imaging" (Khan, 5 Nov 2025)
"Learning Label Hierarchy with Supervised Contrastive Learning" (Lian et al., 2024)
"Seed the Views: Hierarchical Semantic Alignment for Contrastive Representation Learning" (Xu et al., 2020)
"HiLight: A Hierarchy-aware Light Global Model with Hierarchical Local ConTrastive Learning" (Chen et al., 2024)
"Hierarchical Multi-Label Contrastive Learning for Protein-Protein Interaction Prediction Across Organisms" (Liu et al., 3 Jul 2025)