Hierarchical Multi-label Contrastive Losses

Updated 4 May 2026

The paper demonstrates that hierarchical multi-label contrastive losses effectively integrate instance similarity with label hierarchies to improve performance in detection, retrieval, and classification tasks.
It leverages level-wise sampling, adaptive regularization, and prototype-based architectures to address class imbalance and enforce taxonomic consistency.
Empirical results show state-of-the-art gains in domains such as remote sensing, biomedical retrieval, and text classification by encoding multi-level semantic structures.

A hierarchical multi-label contrastive loss is a class of supervised or self-supervised loss functions designed to structure representation spaces such that both instance-level similarity and structural label relationships—including hierarchies and multi-label overlap—are respected. In contrast to flat contrastive objectives, hierarchical multi-label contrastive losses explicitly encode semantic proximity and dissimilarity between data instances by leveraging hierarchy-aware positive/negative pair mining, level- or path-dependent weighting, adaptive regularization, and sometimes prototypes or auxiliary constraints. Recent work has demonstrated that such losses are state-of-the-art for tasks ranging from fine-grained remote sensing detection and biomedical retrieval to hierarchical text and protein–protein interaction classification, addressing common challenges such as class imbalance, taxonomic consistency, and zero-shot generalization (Chen et al., 30 Dec 2025, Lan et al., 17 Apr 2026, Zhu et al., 2024, Zhang et al., 2022, Tian et al., 22 Jan 2025, Liu et al., 3 Jul 2025).

1. Core Formulation and Fundamental Principles

Hierarchical multi-label contrastive learning extends the InfoNCE/SupCon family by introducing pair-wise similarity structures and sampling rules that reflect a target hierarchy $H = (\mathcal{L}, E)$ (tree or DAG). Each data point $x_i$ is annotated with a multi-level or multi-path label set $Y_i \subset \mathcal{L}$ , and two levels of semantics are modeled:

Positive and negative sampling across hierarchy levels: For any anchor $i$ , "positives" comprise points sharing a parent node, sibling relationship, or a sufficient fraction of labels, depending on the hierarchy's structure and depth.
Hierarchy-aware penalties: Additional weighting schemes are defined using tree distance, level depth, or path overlap—e.g., via a penalty function $\Delta(Y_i, Y_j)$ reflecting depth of the lowest common ancestor or the fraction of shared ancestors.

The most formalized version appears as:

$\mathcal{L}_{\mathrm{hier}} = -\sum_{i=1}^N \frac{1}{|P_i|} \sum_{p \in P_i} \log \frac{\exp\left(\frac{\mathrm{sim}(z_i, z_p)}{\tau} - \lambda \Delta(Y_i, Y_p)\right)}{\sum_{a \ne i} \exp\left(\frac{\mathrm{sim}(z_i, z_a)}{\tau} - \lambda \Delta(Y_i, Y_a)\right)}$

with $\mathrm{sim}$ denoting cosine similarity, $P_i$ a set of hierarchy-positive indices, and $\lambda$ a trade-off (Zhang et al., 2022).

Multiple approaches introduce learnable or fixed prototypes for classes at each level, using them both for contrastive similarity computation and to ensure balanced gradient flow across head/tail classes (Chen et al., 30 Dec 2025, Jiang et al., 19 Aug 2025).

2. Hierarchy-specific Loss Construction and Pairwise Weighting

Contrastive losses for hierarchical multi-label settings operationalize hierarchy incorporation via:

Level-wise losses: Separate contrastive loss terms are computed at each hierarchy level and aggregated via convex weighting. For instance, in multi-level supervised contrastive learning, a projection head per level allows level-specific positive mining: for anchor $i$ and level $x_i$ 0, $x_i$ 1 contains all $x_i$ 2 such that $x_i$ 3 (same parent), and temperature $x_i$ 4 can be adapted per level (Ghanooni et al., 4 Feb 2025).
Hierarchy-penalty shaping: The loss may penalize positive pairs that diverge from the anchor higher up in the tree by functions of $x_i$ 5 depth, path difference, or even subtree sizes, resulting in adaptive penalties $x_i$ 6 that guide the learning to preserve close affinity only for close taxonomic relatives (Zhang et al., 2022, Kokilepersaud et al., 2024).
Prototype-based architectures: Some methods maintain momentum-updated or perturbed prototypes for each class at each level, using them as "anchor" points in the embedding space and including them even for rare/unseen classes in each minibatch. Denominator balancing (averaging negative mass by class size) guarantees that tail classes are not overwhelmed by head classes (Chen et al., 30 Dec 2025, Jiang et al., 19 Aug 2025).

A key innovation is decoupled query design—especially in detection models—where classification and localization tasks split their query sets to avoid semantic entanglement, further improved by introducing balanced hierarchical contrastive losses (Chen et al., 30 Dec 2025).

3. Multi-Level and Multi-Label Sampling Strategies

Positive and negative pair construction is central to effective hierarchy-aware contrastive learning:

Level-wise sampling: For each anchor, positives are drawn based on hierarchy-level membership; e.g., positives share a parent at one level but differ at a finer level (Ghanooni et al., 4 Feb 2025).
Set-overlap or taxonomic proximity: In biomedical retrieval and multi-label scenarios, positive and negative sets use label set similarity (e.g., Jaccard or depth-weighted cosine on expanded ancestor label vectors). Positives are those with overlap greater than a threshold; negatives have zero overlap (Lan et al., 17 Apr 2026).
Label-path and multi-path annotation: In text domains (HMTC), positive pairs arise from identical hierarchy-path prefixes, and weights for pair losses are proportional to the depth or closeness of label set matches (Yu et al., 2023).

Sampling and weighting are often tuned by hyperparameters or data-driven schedules (e.g., per-level contribution weights, depth-based penalty scaling, ancestor expansion).

4. Joint Objectives, Constraints, and Optimization

Hierarchical multi-label contrastive loss functions are generally part of composite objectives:

Hybrid and multi-task losses: Contrastive terms are combined with per-level or per-node classification cross-entropy, with losses often weighted by adaptive or learnable factors (Tian et al., 22 Jan 2025, Jiang et al., 19 Aug 2025).
Hierarchy constraint enforcement: Some techniques introduce hinge-style "max-hierarchy" penalties, such that coarse-level grouping cannot be tighter in the embedding space than fine-level grouping. This can be formalized as a maximum-loss margin constraint across levels (Zhang et al., 2022, Liu et al., 3 Jul 2025).
Prototype updating and balancing: Exponential moving average (EMA) or adaptive perturbations are used for robust prototype maintenance, essential for overcoming class imbalance and supporting rare categories (Chen et al., 30 Dec 2025, Jiang et al., 19 Aug 2025).

Optimization is typically performed with AdamW or SGD, often using batch-wise negatives and in-batch positive mining. Levels and weighting schemes are sometimes adjusted based on convergence rates to resolve multi-task optimization bias ("one-strong-many-weak" problem) (Jiang et al., 19 Aug 2025).

5. Empirical Results and Applications

Hierarchical multi-label contrastive objectives have demonstrated state-of-the-art or at least significant improvements in multiple settings:

Domain	Dataset/Task	Key Empirical Finding
Remote sensing	FAIR1M, OrientedFormer	Balanced hierarchical loss reverses degradation from unbalanced HCL; +0.28 AP₅₀ (Chen et al., 30 Dec 2025)
Biomedical IR	PubMed, MeSH	NDCG@10 rises from 0.529 to 0.543 (+2.6%), strong gains in QA recall (Lan et al., 17 Apr 2026)
Text HTC/HMTC	WOS, NYTimes, RCV1-v2	+1–2 pts Macro-F1, better generalization, improved multi-path consistency (Zhu et al., 2024, Yu et al., 2023)
Image/Audio	OrchideaSOL, CIFAR-100	Gains in MNR, NDCG, and retrieval metrics; enhanced zero-shot transfer (Tian et al., 22 Jan 2025, Ghanooni et al., 4 Feb 2025)
Proteomics	SHS27k, cross-organism PPI	Hierarchical penalty boosts hard-case F₁ up to +15%, +5–12% micro-F₁ in zero-shot transfer (Liu et al., 3 Jul 2025)

These improvements come mainly from the model’s ability to capture multi-granular semantic structure, counter class imbalance, and align fine- and coarse-level relationships. Ablations consistently show that removing hierarchical penalties or using flat contrastive losses degrades representation structure, retrieval, and rare-class accuracy.

6. Key Implementational and Theoretical Considerations

Intrinsic challenges and solutions addressed in the literature include:

Class imbalance: Head/tail disparity in real label hierarchies—especially in remote sensing and biomedical data—necessitates balancing terms (prototype inclusion/denominator averaging) to ensure rare classes are not ignored (Chen et al., 30 Dec 2025).
Structural consistency: Metrics such as Hierarchical Violation Rate (HVR) quantify the degree to which predictions respect parent–child consistency, providing a direct measure of effectiveness (Jiang et al., 19 Aug 2025).
Adaptive and data-driven weighting: Loss weights per hierarchy level are often set inversely proportional to class frequency or dynamically (e.g., via softmax over per-task losses) to focus learning where most needed and avoid domination by trivial or fast-converging tasks (Chen et al., 30 Dec 2025, Jiang et al., 19 Aug 2025).
Generalization and zero-shot transfer: By enforcing hierarchy-aware embedding layouts, models can generalize to unseen fine-grained labels via shared ancestry, supporting robust zero-shot or low-data tasks (Tian et al., 22 Jan 2025, Liu et al., 3 Jul 2025).
Robustness and regularization: Controlled prototype perturbation, entropy minimization of hierarchical codes, and cross-modal/symmetric alignments all aim to produce more stable, information-preserving embeddings (Zhu et al., 2024, Jiang et al., 19 Aug 2025, Liu et al., 3 Jul 2025).

Limitations noted include the necessity for supplied or well-annotated hierarchies, potential bias inheritance, and the need for further exploration in DAG or multi-modal contexts (Zhang et al., 2022).

7. Active Research Directions and Extensions

Current research continues to expand the expressiveness and applicability of hierarchical multi-label contrastive losses:

Fine-grained transformer architectures: Decoupled query design and graph-augmented encoders advance the flexibility of attention-based models in hierarchical scenarios, especially in remote sensing detection (Chen et al., 30 Dec 2025, Kumar et al., 2024).
Cross-modal and multi-modal hierarchies: BioHiCL and HIPPO extend contrastive objectives to handle text, protein sequences, and structured annotations, laying groundwork for unified, cross-domain models (Lan et al., 17 Apr 2026, Liu et al., 3 Jul 2025).
Constraint relaxation and hybridization: Hybrid loss frameworks couple contrastive, cross-entropy, focal, and ranking losses, with empirical evidence that such combinations outperform either component alone (Tian et al., 22 Jan 2025, Kokilepersaud et al., 2024).
Hierarchy discovery and learning: Emerging lines of work propose learning the hierarchy end-to-end from data, including extension to non-tree, DAG, or even implicit relational structures (Zhang et al., 2022).

The trajectory of research suggests ongoing expansion toward more flexible, data-driven, and generalizable forms of hierarchical multi-label contrastive learning, especially for applications in low-resource, cross-domain, and multi-modal settings.