Hierarchical Embeddings (HIER-Embeds)

Updated 1 April 2026

Hierarchical Embeddings (HIER-Embeds) are methods that represent structured, tree-like data by encoding parent-child, ancestor-descendant relations in continuous vector spaces.
They leverage geometric frameworks—hyperbolic, region-based Euclidean, and variable-curvature models—to faithfully mimic hierarchical inclusion and metric separation.
Optimization employs margin-based ranking losses and specialized training objectives to ensure scalable and interpretable embedding of complex hierarchies.

Hierarchical Embeddings (HIER-Embeds) denote a class of embedding methods explicitly designed to represent hierarchical data—such as trees, directed acyclic graphs (DAGs), taxonomies, or partial orders—in continuous geometric or probabilistic vector spaces. The principal aim is to encode semantics and relationships so that hierarchical structure (e.g., parent-child, ancestor-descendant, is-a, hypernym) is faithfully reflected in terms of spatial inclusion, metric distance, or distributional ordering. These models provide a mathematical and algorithmic framework for learning representations that naturally encode hierarchy-specific properties distinct from flat or unstructured embedding techniques.

1. Geometric Frameworks for Hierarchical Embeddings

The HIER-Embeds literature has converged on two main geometric strategies: constant-curvature Riemannian manifolds (most notably hyperbolic geometry) and region-based Euclidean models.

Hyperbolic Geometry: Hyperbolic spaces (Poincaré ball, Lorentz model) are the canonical continuous analogues of trees due to their exponential volume growth and negative curvature. Embeddings in the Poincaré ball model $\mathcal{B}^d = \{x \in \mathbb{R}^d: \|x\| < 1\}$ produce geodesic distances

$d_{\mathcal{D}}(x, y) = \operatorname{arcosh}\left(1 + 2\frac{\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right)$

that grow rapidly near the boundary and naturally separate hierarchical levels (“roots” near the origin, “leaves” near the boundary), allowing low-distortion representation of trees with large fan-out and depth (Ganea et al., 2018, Dhingra et al., 2018, Chatterjee et al., 2021, Li et al., 23 Dec 2025).

Region-based Euclidean Models: Alternatively, HIER-Embeds may represent concepts as geometric regions (balls, boxes, convex cones) in $\mathbb{R}^n$ , encoding hierarchy via set-inclusion. Recent region-distance (RegD) models define depth and boundary metrics over these regions to emulate hyperbolic separation and enforce inclusion, thus bridging the geometric gap between Euclidean and hyperbolic paradigms (Yang et al., 29 Jan 2025). Region-based approaches are especially pertinent when explicit set-based logical relations (inclusion, intersection) need to be addressed.

Variable-Curvature Models: To handle structures not strictly tree-like, embedding in variable-curvature spaces—such as the complex hyperbolic unit ball $\mathcal{B}^n_{\mathbb{C}}$ with the Bergman metric—allows for local adaptation of negative curvature, enabling faithful embedding of multitrees and overlapping hierarchies (Xiao et al., 2021).

2. Hierarchy as Partial Order: Representation and Entailment

The essential property of HIER-Embeds is that the embedding reflects a partial order: for a DAG or poset $(X, \preceq)$ , each node $u \in X$ is mapped to $x_u$ in the continuous space $M$ so that parent-child or ancestor-descendant relationships manifest as geometric inclusion, angular constraints, metric dominance, or probabilistic encapsulation:

Entailment Cones: A concept $u$ is associated with a convex or geodesically convex cone at $x_u$ . The entailment region for $d_{\mathcal{D}}(x, y) = \operatorname{arcosh}\left(1 + 2\frac{\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right)$ 0 is the set of all $d_{\mathcal{D}}(x, y) = \operatorname{arcosh}\left(1 + 2\frac{\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right)$ 1 within aperture $d_{\mathcal{D}}(x, y) = \operatorname{arcosh}\left(1 + 2\frac{\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right)$ 2, frequently defined via a closed-form function. The partial order is then $d_{\mathcal{D}}(x, y) = \operatorname{arcosh}\left(1 + 2\frac{\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right)$ 3 (hyperbolic) or $d_{\mathcal{D}}(x, y) = \operatorname{arcosh}\left(1 + 2\frac{\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right)$ 4 (Euclidean) (Ganea et al., 2018, Dhall et al., 2020).
Metric Separation: In hyperbolic models, hierarchy is further encoded by the monotonic increase of distance from the origin with depth: $d_{\mathcal{D}}(x, y) = \operatorname{arcosh}\left(1 + 2\frac{\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right)$ 5 if $d_{\mathcal{D}}(x, y) = \operatorname{arcosh}\left(1 + 2\frac{\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right)$ 6 is higher in the hierarchy than $d_{\mathcal{D}}(x, y) = \operatorname{arcosh}\left(1 + 2\frac{\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right)$ 7. Pairwise distances between unrelated concepts increase with their separation in the tree.
Region Inclusion: In region-based HIER-Embeds, set inclusion $d_{\mathcal{D}}(x, y) = \operatorname{arcosh}\left(1 + 2\frac{\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right)$ 8 is enforced for $d_{\mathcal{D}}(x, y) = \operatorname{arcosh}\left(1 + 2\frac{\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right)$ 9, measured via asymmetric boundary distances (Yang et al., 29 Jan 2025).
Probabilistic Encapsulation: Density order embeddings (DOE) map each concept to a Gaussian $\mathbb{R}^n$ 0 such that higher-level concepts have larger covariance (volume) and entailments correspond to strict or soft encapsulation of one density's high-probability region within another (Athiwaratkun et al., 2018).

3. Learning Objectives and Optimization Techniques

Margin-Based Ranking Losses and Constraints

A prevalent approach is a margin-based loss that penalizes violations of the geometric entailment condition. For a positive (ancestor, descendant) pair $\mathbb{R}^n$ 1 and negative pairs (non-descendants), the loss takes the general form: $\mathbb{R}^n$ 2 where $\mathbb{R}^n$ 3 quantifies whether $\mathbb{R}^n$ 4 falls within the entailment region of $\mathbb{R}^n$ 5 or, for region-based models, via depth/boundary distances.

For hyperbolic models, Riemannian gradient descent or Riemannian Adam is employed, requiring conversion of Euclidean gradients to manifold gradients via the metric tensor. Dilation operations and transitive closure regularization are used to address “capacity illnesses” (i.e., local overcrowding, subtree overlap) and enforce global separation between subtrees (Wang et al., 2024).

Hierarchical Clustering and Probabilistic Generative Models

Probabilistic models (e.g., HCRL (Shin et al., 2019)) couple variational autoencoders with hierarchical Gaussian mixture priors. Level-proportion variables dictate the degree to which an instance is generated at each tree level, and inference involves optimizing an ELBO that integrates mixture assignment, level choice, and reconstruction terms.

Matryoshka embeddings (Hanley et al., 30 May 2025) leverage a dimensional truncation framework, where successive prefixes of the base embedding represent coarser-to-finer granularity and are used in level-wise clustering algorithms.

Specialized Training Objectives

Multi-similarity Contrastive Losses: HiPrBERT trains using multi-level contrastive objectives, partitioning positive and negative pairs by explicit levels in biomedical ontologies, with hard-negative mining to enforce separation at each distance category (Cai et al., 2023).
Hyperbolic Cross-Modal Losses: H²em integrates hyperbolic entailment cones and multi-modal alignment (via contrastive InfoNCE) with hard-negatives for compositional zero-shot learning, using hyperbolic cross-modal attention for fusion (Li et al., 23 Dec 2025).

4. Hierarchy-Aware Applications and Empirical Findings

Natural Language and Lexical Semantics

Lexical Hierarchies: Hyperbolic and order-encapsulating models excel at capturing hypernymy and entailment in WordNet, graded lexical entailment (HyperLex), and multilingual generalization (Ganea et al., 2018, Nguyen et al., 2017, Dhingra et al., 2018, Athiwaratkun et al., 2018).
Multi-Label Classification: Joint embedding of documents and labels in hyperbolic space yields superior micro-F1, higher correlation with latent tree distances, and more faithful recovery of unknown taxonomy (Chatterjee et al., 2021).
LLMs and Attention: Hierarchical manifold projections and geodesic-aware attention improve lexical alignment, generalization, interpretability, and robustness in multi-domain transformer architectures (Martus et al., 8 Feb 2025).

Computer Vision and Generative Models

Hierarchical Image Classification: Injecting entailment cones into CNNs provides higher F1 on taxonomically organized datasets (e.g. ETHEC, with >700 labels spanning 4 levels) relative to flat classifiers (Dhall et al., 2020).
Global Visual Geolocation: Hyperbolic entity embeddings of geographic hierarchies (country, region, subregion, city) enable efficient, memory-saving, and accurate image-to-location alignment, outperforming retrieval-based and generative baselines on large benchmarks (Gadi et al., 30 Jan 2026).
Phylogenetic Conditioning in Diffusion Models: Concatenative HIER-Embeds reflecting multi-level ancestry enable controlled generative trait edits (masking, swapping) in diffusion-based generation, with trait changes measurable via classification shifts and embedding distances closely mirroring true tree distances (Khurana et al., 2024).

Graph and Network Embedding

Mixed-Membership and Clustering: Hierarchical attentive membership models infer context-dependent groupings and hierarchical partitions in graphs, integrating dual attention and skip-gram objectives, outperforming flat GNNs and pooling baselines on node classification and link prediction (Lin et al., 2021).
Hierarchical Clustering: Matryoshka and HCRL models yield multiscale clusterings across language, vision, and graph domains, allowing interpretable theme-topic-story or abstraction-granularity decompositions (Shin et al., 2019, Hanley et al., 30 May 2025).

5. Theoretical Properties and Extensions

Volume Growth and Capacity

A core reason for the success of hyperbolic geometry in HIER-Embeds is its exponential volume growth: the number of nodes that can be embedded without crowding increases exponentially with radius, matching the growth of trees and deep hierarchies. In contrast, Euclidean volume increases only polynomially, rapidly resulting in crowding and poor discrimination of siblings or deep leaves (Ganea et al., 2018, Wang et al., 2024, Li et al., 23 Dec 2025).

Variable Curvature and Hybrid Models

In taxonomies with heterogeneous or multitree substructures, variable-curvature complex hyperbolic models allow local adaptation: highly tree-like regions use strong negative curvature; densely connected regions use milder curvature. Angular separation in the complex unit ball further disambiguates siblings and overlapping subtrees (Xiao et al., 2021).

Extensions include mixed-curvature manifolds and product geometric spaces (hyperbolic×Euclidean), region models with ellipsoids or polytopes, and combinatorial-analytic frameworks for representation guarantees (Chatterjee et al., 2021, Yang et al., 29 Jan 2025, Xiao et al., 2021).

6. Limitations and Open Challenges

Optimization Complexity: Riemannian optimization in curved spaces or with variable curvature is less numerically robust and computationally more intensive than standard Euclidean SGD. Dilation and regularization schedules must be tuned to avoid capacity/intra-subtree/inter-subtree collapse (Wang et al., 2024).
Non-tree Structures: DAGs and multitrees violate constant-curvature idealizations. While region, hybrid, or variable-curvature models help, they do not fully close the representational gap.
Interpretable Latent Structure: In continuous hyperbolic embeddings (especially with neural encoders over infinite domains), direct extraction of discrete parent-child relationships is non-trivial without additional probing or regularization (Dhingra et al., 2018).
Modeling logical combinations: Embedding arbitrary logical rules (conjunctions/disjunctions/negations) in a fully geometric way remains an open area, though region-based approaches offer potential solutions (Yang et al., 29 Jan 2025).

7. Cross-Domain and Practical Impact

HIER-Embeds have established state-of-the-art performance in settings where deep structure, label explosion, or fine semantic discrimination dominates:

Biomedical concept normalization and ontology modeling (Cai et al., 2023, Li et al., 23 Dec 2025).
Lexical taxonomy and entailment inference (Nguyen et al., 2017, Athiwaratkun et al., 2018).
Global image geolocation and retrieval (Gadi et al., 30 Jan 2026).
Hierarchical clustering in multilingual and cross-modal settings (Hanley et al., 30 May 2025, Shin et al., 2019).
Open-world compositional reasoning and zero-shot learning (Li et al., 23 Dec 2025).
Structured sequence modeling and multi-hop reasoning in LLMs (Patil et al., 25 May 2025).

Hierarchical Embeddings, formalized and operationalized across these domains, provide a principled geometric, probabilistic, and algorithmic foundation for scalable, expressive, and interpretable multi-resolution representation learning in modern machine intelligence.