Modality-Aware Hierarchical Contrastive Learning

Updated 22 November 2025

The paper introduces MHCL, which treats hierarchical data as a formal modality to enhance contrastive learning by modulating loss functions.
It employs hierarchy-aware sampling and dynamic margin scheduling to structure embedding spaces from coarse global to fine local details.
MHCL demonstrates significant improvements in clustering, classification, and multi-modal retrieval across vision, recommendation, and 3D data tasks.

Modality-aware Hierarchical Contrastive Learning (MHCL) is a paradigm within representation learning that incorporates explicit hierarchical or multi-level structure as an additional modality in the process of contrastive self-supervised or multi-modal representation learning. By treating hierarchy, item-item semantic structure, or ordinal feature relations as a formal "modality"—analogous to text, image, or point cloud signals—MHCL enables models to encode both intra-modal and cross-modal conceptual organization, going beyond naive instance discrimination or unimodal contrastive paradigms. The methodology has been instantiated in diverse applications including hierarchical vision datasets, multimodal recommendation, large-scale human-centric pretraining, and hyperbolic multi-modal architectures spanning text, 2D, and 3D data (Bhalla et al., 6 Jan 2024, Zhang et al., 2021, Hong et al., 2022, Liu et al., 4 Jan 2025).

1. Concept and Scope of Modality-Aware Hierarchical Contrastive Learning

MHCL formalizes data hierarchy—whether explicit (as in spatial or semantic trees) or latent (as in modality-specific affinity structures)—as a first-class signal within contrastive learning. In settings such as the WikiScenes cathedral dataset, each data point (e.g., image) is indexed not only by its raw content, but also by a position in a hierarchy (e.g., “Interior → Nave → Altar”). MHCL utilizes this hierarchical index as a form of weak supervision: samples closer in the tree are treated as more semantically similar, while their separation within the hierarchy modulates the contrastive learning process (Bhalla et al., 6 Jan 2024). In multi-modal contexts (e.g., human-centric or point cloud data), hierarchical structure manifests both within and across modalities (dense/sparse views, part-whole organization, text-image-3D relations) and is encoded via hierarchical or compositional contrastive objectives (Hong et al., 2022, Liu et al., 4 Jan 2025).

2. Model Architectures and Hierarchical Encoding Strategies

MHCL frameworks adapt their architectures to the target domain and modality composition:

MHCL with Explicit Tree Structure (e.g., WikiScenes): Employs a visual backbone (e.g., VGG-16, with frozen convolutional layers), with latent representations modulated only via hierarchy-aware sampling and loss scheduling. No dedicated hierarchy encoder is used; instead, the role of hierarchy is realized through informed triplet selection and adaptive margin parameters, structuring the embedding space according to the data's tree (Bhalla et al., 6 Jan 2024).
Multimodal Hierarchical Design (HCMoCo, MICRO): Separate encoders per modality (RGB, depth, text, keypoints, etc.) project into a shared latent space; hierarchy appears in the structure-mining phase (modality-aware affinity graphs) and/or in task-specific or cross-level contrastive losses (Hong et al., 2022, Zhang et al., 2021).
Hyperbolic MHCL for 3D and Cross-modal Alignment: Integrates text, image, and 3D point cloud encoders, projecting all outputs onto a shared hyperbolic manifold (Lorentz model) via exponential map to encode both intra-modal and inter-modal hierarchy. Hierarchy is further modulated by regularizers that capture entailment, centroid gaps, and part-whole semantic relationships (Liu et al., 4 Jan 2025).

3. Hierarchical Contrastive Learning Objectives and Losses

A defining feature of MHCL is the modulation of contrastive objectives by hierarchy:

Triplet Margin Loss with Level-specific Scheduling: For anchor $x_a$ , positive $x_p$ (same or descendant node), and negative $x_n$ (sibling node), the loss is

$\mathcal{L}(x_a,x_p,x_n;f) = \max \{ 0, \|f(x_a) - f(x_p)\|_2^2 - \|f(x_a) - f(x_n)\|_2^2 + \alpha(h) \}$

where $\alpha(h)$ is a margin that decreases with hierarchy depth, promoting broader cluster separation at higher levels and finer local structure at leaves (Bhalla et al., 6 Jan 2024).

Multilevel Hierarchical Losses (HCMoCo):
- Sample-level InfoNCE on global pooled representations ( $\mathcal{L}_g$ ),
- Dense intra-sample contrastive loss ( $\mathcal{L}_d$ ) aligning structured pixel maps between paired modalities with spatially-weighted soft assignments,
- Sparse structure-aware contrastive loss ( $\mathcal{L}_s$ ) for anchor-keypoint correspondences.
- The final loss combines all levels: $\mathcal{L} = \lambda_g \mathcal{L}_g + \lambda_d \mathcal{L}_d + \lambda_s \mathcal{L}_s$ (Hong et al., 2022).
Hierarchical Regularization in Hyperbolic Space: Regularizers target
- Entailment cone inclusions for compositional alignment across modalities,
- Modality-gap (centroid distance) constraints for inter-modal hierarchy,
- Part-whole cone alignment for point cloud substructure (Liu et al., 4 Jan 2025).
Contrastive Fusion in Modality Graphs (MICRO): Fuses graph-convolved embeddings over modality-specific affinity graphs, then enforces agreement between the fused and per-modality views using an InfoNCE-style loss (Zhang et al., 2021).

4. Hierarchy-Aware Sampling and Training Protocols

MHCL leverages tailored sampling, margin scheduling, and training to encode hierarchical relations:

Level-specific Triplet Sampling: Triplets are drawn respecting the data tree. Anchors and positives are chosen from the same node or descendants; negatives are sampled from sibling nodes. The margin $\alpha(h) = (h_{\max} - h)^2 + \alpha_{\min}$ enforces large inter-cluster distances at high levels, promoting coarse-to-fine latent organization (Bhalla et al., 6 Jan 2024).
Replay Regularization: To prevent catastrophic forgetting of coarse categories, batches probabilistically revisit higher hierarchy levels with rate $r_p$ , ensuring stable encoding of both global and local structure throughout training (Bhalla et al., 6 Jan 2024).
Curricular Training (HCMoCo): Hierarchical losses are phased: initial alignment on global embeddings, then progressive addition of dense and sparse hierarchy-aware objectives (Hong et al., 2022).
Graph-based Structure Learning (MICRO): Initial modality graphs mined via top- $k$ nearest neighbor affinities; latent graphs refined by linearly projecting features, with fusion via skip-connections to preserve stable semantics before contrastive fusion (Zhang et al., 2021).
Optimization: Adam or AdamW optimizers are standard, with careful hyperparameter selection for margin, replay rate, batch size, and graph construction (Bhalla et al., 6 Jan 2024, Zhang et al., 2021, Hong et al., 2022, Liu et al., 4 Jan 2025).

5. Evaluation Methodologies and Empirical Insights

MHCL frameworks consistently assess representation coherence and downstream task transfer:

Clustering and Latent Structure Visualization: Post-training t-SNE or PCA reveals that embedding geometry encodes both explicit hierarchy and downstream class separation, exceeding naive contrastive and off-the-shelf multimodal baselines in cluster purity and conceptual coherence (Bhalla et al., 6 Jan 2024).
Downstream Classification and Transfer: Pretrained encoders are frozen; a single-layer classifier is trained over semantic or task-specific classes. MHCL consistently improves mean average precision (mAP, mAP*) and per-class accuracy compared to both unimodal and text-informed contrastive methods (Bhalla et al., 6 Jan 2024).
Recommendation and Retrieval (MICRO): MHCL-augmented embeddings drive substantial improvements in Recall@20 in multimedia recommendation, especially under cold-start and low-interaction scenarios, with ablation showing structure learning and contrastive fusion as key contributors (Zhang et al., 2021).
Data-efficient Human-centric Tasks: In dense pose estimation, human parsing, and 3D pose, MHCL delivers strong gains (+7.16% GPS AP, +6.45 mIoU in scarce data settings), and enables missing-modality and cross-modality inference beyond standard supervised transfer (Hong et al., 2022).
3D and Multimodal Hyperbolic Embeddings: On ModelNet40/10 and ShapeNetPart, hyperbolic MHCL outperforms unregularized or non-hierarchical baselines in both fine-tuning and few-shot regimes. Hierarchy regularizers reduce Gromov $\delta$ -hyperbolicity, confirming sharper tree-like embedding structure (Liu et al., 4 Jan 2025).

6. Extensions, Limitations, and Future Directions

MHCL generalizes naturally to a range of hierarchically structured and multi-modal domains:

Broader Application Domains: Suitable for semantic trees (e.g., ImageNet), e-commerce taxonomies, 3D part-whole hierarchies, and any scenario where explicit or latent hierarchy pervades (Bhalla et al., 6 Jan 2024, Liu et al., 4 Jan 2025).
Combinatorial and Multimodal Fusions: MHCL can be integrated with textual captioning, 3D geometry, or further modalities to enrich representation space (Bhalla et al., 6 Jan 2024, Liu et al., 4 Jan 2025).
Adaptive Hierarchy Mechanisms: Dynamic adjustment of maximum hierarchy depth $h_{\max}$ or architectural parameters to accommodate unbalanced trees or complex data structures (Bhalla et al., 6 Jan 2024).
Alternative Contrastive Objectives: Exploration of NT-Xent, cosine-margin, or N-pair losses as substitutes for triplet-based schemes (Bhalla et al., 6 Jan 2024).
Joint Fine-tuning and Full-stack Adaptation: Moving from partial to end-to-end fine-tuning of backbones in response to new modality signals (Bhalla et al., 6 Jan 2024, Hong et al., 2022).
Further Modalities and Structures: Current approaches are limited to static text/2D/3D signals; extensions to video, audio, streaming, or graph-structured data are plausible next targets (Liu et al., 4 Jan 2025).

7. Summary and Implications

MHCL demonstrates that hierarchical structure, when treated as a first-class modality, acts as a potent supervisory signal for contrastive representation learning. Incorporating data hierarchy via margin scheduling, targeted sampling, or hierarchical regularization produces embedding spaces that are visually discriminative, conceptually consistent, and robust even under weak or scarce supervision. The paradigm is competitive with or surpasses purely visual or text-informed contrastive frameworks, especially where explicit or discoverable hierarchy exists. Notwithstanding current limitations (modality coverage, manual tree construction), MHCL's architecture-agnostic nature and adaptability suggest wide applicability in domains where embedding semantic or spatial hierarchies confers downstream efficiency and transferability (Bhalla et al., 6 Jan 2024, Zhang et al., 2021, Hong et al., 2022, Liu et al., 4 Jan 2025).