Hierarchical Contrastive Loss (HCL)

Updated 23 February 2026

Hierarchical Contrastive Loss (HCL) is a framework that integrates multi-level hierarchical structures into contrastive learning to enhance semantic representations.
It employs level-dependent weighting and ordering constraints to enforce similarity and dissimilarity at multiple granularities.
HCL has been effectively applied in computer vision, NLP, graph learning, time series, and remote sensing to achieve improved accuracy and robustness.

Hierarchical Contrastive Loss (HCL) is a family of objectives that generalizes standard contrastive learning methods to explicitly account for the presence of hierarchical or multi-scale structure in data. HCL frameworks are designed to facilitate the learning of representations that respect multiple levels of semantic, structural, or temporal organization, surpassing the limitations of flat instance-level or class-level contrastive losses. HCL has been instantiated across a wide range of domains including computer vision, graph learning, natural language processing, time series analysis, and remote sensing.

1. Mathematical Formulation and Core Principles

The central insight of HCL is to structure the contrastive objective so that similarity and dissimilarity between samples are enforced not only at the "flat" instance or class level, but also at multiple levels of granularity or hierarchy. This is achieved by:

Defining a set of levels $\ell=0,\dots,L$ , corresponding to progressively coarser or more abstract groupings (e.g., leaf class, parent, super-parent in label trees; sequence segment vs. full sequence in NLP; multiscale graph pools in GNNs).
At each level, constructing positive pairs whose lowest common ancestor (LCA) in the hierarchy is at depth $\ell$ , and weighting their contribution to the loss by level-dependent penalties or constraints.

For example, the general multi-level supervised contrastive form as in "Use All The Labels" (Zhang et al., 2022):

$\mathcal{L}_{\mathrm{HiMulCon}} = \sum_{\ell=0}^L \frac{1}{L+1} \sum_{i\in A} \left( -\frac{\lambda_\ell}{|P_\ell(i)|} \sum_{p\in P_\ell(i)} \log\frac{\exp(f_i\cdot f_p/\tau)}{\sum_{a\in A\setminus\{i\}} \exp(f_i\cdot f_a/\tau)} \right)$

where $P_\ell(i)$ are positives at level $\ell$ (sharing exactly the same ancestor at that level), $\lambda_\ell$ is a monotonically increasing function with $\ell$ , and $f_i$ are normalized feature embeddings.

Hierarchy constraints can also enforce distance ordering (HiConE), ensuring that samples grouped more specifically (deeper in the tree) are embedded closer together than those grouped coarsely. The combination form (HiMulConE) jointly applies level weighting and order constraints.

Contrastive objectives can be extended to multi-level settings with other forms, such as ratio-based losses across network outputs at multiple resolutions (2212.11473), or contrastive regularization using taxonomy-aware negative weighting (Kokilepersaud et al., 2024).

2. Operationalization Across Modalities

2.1 Computer Vision

In fine-grained and hierarchical vision tasks, HCL is used to enforce inter-class separation and intra-class consistency at every hierarchical level, from object parts up to whole-entity categories. Papers such as "H3Former: Hypergraph-based Semantic-Aware Aggregation via Hyperbolic Hierarchical Contrastive Loss for Fine-Grained Visual Classification" introduce Hyperbolic HCL (HHCL), which combines Euclidean and hyperbolic distances at each semantic granularity, along with a partial order preservation term to strengthen parent–child consistency in hyperbolic space (Zhang et al., 13 Nov 2025).

Hierarchy-aware contrastive variants have also been proposed for tasks requiring taxonomic awareness, e.g., "Taxes Are All You Need" (Kokilepersaud et al., 2024), which up-weights the penalty on 'taxonomic negatives'—samples sharing a coarse semantic label but differing in fine class—thus encouraging discriminability both within and across hierarchical superclasses.

2.2 Natural Language Processing

HCL is applied in hierarchical text classification ("HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification" (Zhu et al., 2024)) by constructing positive pairs using both a semantic encoder and a structure encoder. The latter injects label hierarchy via a coding tree, whose construction is guided by structural entropy minimization. The HCL objective here is the NT-Xent (normalized temperature-scaled cross entropy) loss over semantic–hierarchy positive pairs, with batch negatives.

For unsupervised sentence representations, HCL interpolates between local (segment-level) and global (sequence-level) InfoNCE losses, facilitating improved sample discrimination and reduced complexity (Wu et al., 2023).

2.3 Graph Representation Learning

Hierarchical graph contrastive learning formalizes positive and negative sample construction at multiple coarsenings of the graph structure. The HCL framework in "HCL: Improving Graph Representation with Hierarchical Contrastive Learning" (Wang et al., 2022) defines a contrastive MI-maximizing loss at each level of a hierarchically pooled graph, fused by scale-dependent weighting. This approach yields improved representations for node-level, cluster-level, and graph-level tasks.

Multi-view graph scenarios employ hierarchy-informed negative sampling, as in Neighbor Hierarchical Sifting (NHS) (Ai et al., 2024), where negatives are filtered based on adjacency (first-order), multi-view similarity, and attribute analog. This structure-guided sifting reduces false negatives and aligns embedding separation with semantic structure.

2.4 Remote Sensing and Detection

"Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images" (Chen et al., 30 Dec 2025) extends HCL by introducing EMA-updated class prototypes for every hierarchy level, ensuring that each class—regardless of frequency—contributes equally to the contrastive gradients. The loss balances classes by re-weighting negative terms inversely by class occupancy, and is evaluated on each decoder layer separately, supporting fine-grained detection tasks with long-tailed label distributions.

2.5 Time Series Anomaly Detection

In the multivariate time series domain, HCL has been employed to capture multi-scale consistency across measurement, sample, channel, and process levels, finding anomalies by examining systematic deviations from multi-level data associations (Sun et al., 2024). This approach has been shown to surpass previous models in F1 performance across industrial datasets.

3. Hierarchical Structure Construction and Sample Formulation

Construction of the hierarchy depends on the data domain:

Label Tree / Taxonomy: Common in supervised tasks; hierarchy can be induced from domain knowledge, taxonomies, or via clustering centroid assignment (e.g., CHIP (Mittal et al., 2023)).
Structural or Coding Trees: Built by minimizing structural entropy or by hyperbolic K-means clustering for unsupervised settings (Wei et al., 2022).
Multiscale Network Outputs: Used in neural networks operating on images or sequences, where outputs at different resolutions (coarse-to-fine) yield hierarchical exemplar sets (2212.11473).

In batch-wise implementation, positive and negative sets are defined per hierarchy level or cross-level, and losses are computed accordingly, often with per-level weighting or balancing.

4. Integration with Other Learning Objectives and Optimization

HCL is typically integrated as an additive regularizer to existing objectives such as reconstruction (e.g., Charbonnier loss in dehazing (2212.11473)), cross-entropy, or standard supervised contrastive loss. In semi-/self-supervised contexts, as in S5CL (Tran et al., 2022), HCL unifies branch-specific contrastive objectives for labeled, unlabeled, and pseudo-labeled data, assigning distinct temperatures to induce an ordering of intra-instance, intra-class, and inter-class distances.

Sophisticated adaptive weighting schemes, instance-prototype alignment (e.g., hyperbolic prototype losses (Wei et al., 2022)), and negative-sample refinement (e.g., NHS (Ai et al., 2024); taxonomy-weighted negatives (Kokilepersaud et al., 2024)) further tailor the HCL objective to challenge-specific requirements.

Optimization strategies remain standard, but algorithms carefully manage numerics associated with, e.g., hyperbolic Riemannian gradients (Ge et al., 2022), batch prototype updates (Chen et al., 30 Dec 2025), and temperature annealing across levels.

5. Theoretical Properties and Empirical Insights

HCL procedures are supported by both theoretical and empirical evidence:

Information Preservation: Lossless positive augmentation (as in HILL) guarantees retention of maximal mutual information between input and target, eliminating information drop-out from arbitrary data augmentation (Zhu et al., 2024).
Hierarchical Distance Consistency: Explicitly enforced ordering of embeddings by specificity preserves hierarchy in latent space and improves generalization, clustering, and retrieval (see macro F1/precision/recall gains in (Zhang et al., 2022, Wu et al., 2023)).
Mitigating Class Imbalance: Balanced hierarchical objectives with per-class gradient normalization ensure that rare classes do not "vanish" during optimization (Chen et al., 30 Dec 2025).
Empirical Superiority: HCL variants consistently outperform flat or supervised-contrastive baselines across diverse metrics—classification accuracy, F1 score, clustering NMI/ARI, and mAP—on fine-grained, long-tailed, zero-shot, and anomaly detection tasks.

Ablation studies across contrastive levels, pooling strategies, and prototype integration confirm that performance gains arise from respecting hierarchical structure, not merely from added loss terms.

6. Current Limitations and Open Challenges

Open issues documented in the literature include:

Hyperparameter Sensitivity: Level penalties, temperature schedules, prototype learning rates, and negative sampling thresholds remain empirical; robustness to these design choices is not always established (Ai et al., 2024, Chen et al., 30 Dec 2025).
Computational Complexity: Some forms of HCL involve quadratic pairwise computations (e.g., exhaustive similarity matrices) or extra per-level/prototype storage, which may become prohibitive on large-scale or deep hierarchies (Ai et al., 2024, Wei et al., 2022).
Hierarchy Construction: Automated or learned hierarchy induction may introduce noise or sub-optimal partitioning, especially in weakly-structured or unsupervised domains.
Extension to Non-Tree Structures: Most HCL formulations assume a tree or strict level structure; extension to DAGs or cyclic label graphs requires further methodological development.
Interference with Other Tasks: For tasks where multiple objectives compete (e.g., class-agnostic localization vs. semantic contrastive grouping), decoupled optimization or composite strategies are required (Chen et al., 30 Dec 2025).

Research continues on dynamic level weighting, multi-modal HCL, cross-domain transfer, and integrating hyperbolic and Euclidean embedding schemes for more general relational data.

7. Representative Instantiations and Performance Summary

The following table summarizes key HCL variants and their domain of application:

Paper Title/ID	Domain	HCL Principle
"Use All The Labels" (Zhang et al., 2022)	Vision, ML	Level-weighted, hierarchy-enforcing
"Taxes Are All You Need" (Kokilepersaud et al., 2024)	Vision	Taxonomic-weighted negatives
"H3Former" (Zhang et al., 13 Nov 2025)	Fine-grained vision	Hyperbolic, multi-level, POPL penalty
"HILL" (Zhu et al., 2024)	NLP	InfoNCE on semantic+structure pairs
"HiCL" (Wu et al., 2023)	Sentence embeddings	Combined segment/global InfoNCE
"Restoring Vision in Hazy Weather" (2212.11473)	Image restoration	Cross-scale positive/negative ratios
"HCL: Improving Graph Representation" (Wang et al., 2022)	Graphs	Multiscale MI-maximization
"BHCL with Decoupled Queries" (Chen et al., 30 Dec 2025)	Remote sensing	Balanced prototypes, multi-level
"Neighbor Hierarchical Sifting" (Ai et al., 2024)	Graph/NLP	Attr/graph-guided negative filtering
"Hyperbolic Hierarchical Contrastive Hashing" (Wei et al., 2022)	Hashing	Hyperbolic instance & prototype-wise
"Hyperbolic Contrastive Learning" (Ge et al., 2022)	Vision	Scene–object hierarchy in Poincaré ball
"CHIP" (Mittal et al., 2023)	Few-shot image class.	Level-hinge contrastive per encoder
"S5CL" (Tran et al., 2022)	Semi-supervised vision	Labeled/unlabeled/pseudo triple loss

In all cases, empirical results demonstrate that HCL leads to more semantically and structurally faithful embeddings, with downstream task gains documented across benchmarks.

HCL represents a broad, unifying objective class for hierarchy- and multi-scale-aware contrastive learning, deploying domain-specific formulations to formalize and exploit the structure of real-world data. Its ongoing development continues to shift the standard for representation learning in structured and hierarchical domains.