Papers
Topics
Authors
Recent
2000 character limit reached

Hyperbolic Image-Text Representations

Updated 4 February 2026
  • Hyperbolic image-text representations are multimodal embedding techniques leveraging hyperbolic geometry to naturally capture hierarchical and tree-like data structures.
  • They employ contrastive learning and entailment cones within Lorentz and Poincaré models to align visual and linguistic information effectively.
  • Recent extensions include density-based embeddings to model semantic uncertainty, leading to improved retrieval, classification, and generative performance.

Hyperbolic image-text representations define a class of multimodal embedding techniques that model images and text within hyperbolic manifolds, typically the Lorentz (hyperboloid) or Poincaré ball models, to directly encode hierarchical relationships and partial-order entailment structures. Unlike traditional Euclidean or spherical embeddings, hyperbolic geometry—characterized by negative curvature and exponential volume growth with radius—naturally accommodates tree-like, coarse-to-fine data structures. These geometric properties are aligned with the semantic and compositional hierarchies prevalent in visual and linguistic data, enabling improved alignment, retrieval, and interpretability for numerous downstream applications. Contemporary work extends this paradigm from deterministic (point) embeddings to probabilistic (density-based) hyperbolic representations, further capturing semantic uncertainty in ambiguous image-text pairs.

1. Hyperbolic Geometry and Its Suitability for Hierarchy

Hyperbolic space, with constant negative curvature, expands volume exponentially with distance from the origin, making it an ideal embedding substrate for data exhibiting hierarchical or tree-like organization. The most widely used models are the Lorentz (hyperboloid) model and the Poincaré ball. In the Lorentz model, a point xRn+1x\in\mathbb{R}^{n+1} is subject to the constraint x,xL=1/c\langle x, x\rangle_{\mathcal L} = -1/c for curvature c>0c>0, with the Lorentzian inner product defined as x,yL=x0y0+i=1nxiyi\langle x, y\rangle_{\mathcal L} = -x_0y_0 + \sum_{i=1}^n x_i y_i. The geodesic distance is dL(x,y)=1carccosh(cx,yL)d_\mathcal L(x, y) = \frac{1}{\sqrt{c}}\mathrm{arccosh}\big(-c\,\langle x, y\rangle_{\mathcal L}\big).

These properties are exploited to embed general (text, broad image contexts) concepts near the origin and specific (detailed image content, fine linguistic phrases) concepts near the boundary, preserving both semantic similarity and hierarchical depth (Desai et al., 2023, Qiao et al., 2024, Pal et al., 2024, Baek et al., 26 Nov 2025).

2. Contrastive Learning and Entailment Cones in Hyperbolic Space

Contrastive learning in hyperbolic space builds on standard dual-encoder architectures (e.g., Vision Transformer and text Transformer) but projects outputs onto a shared hyperbolic manifold using the exponential map at the origin. The typical approach, as in MERU (Desai et al., 2023), is to append a zero-valued time component to the Euclidean feature and apply the exponential map to obtain a hyperbolic embedding.

The hyperbolic contrastive loss replaces cosine or Euclidean distance with the negative Lorentzian geodesic distance, making the InfoNCE objective geometry-aware. To model partial-order entailment (“text entails image” relations), hyperbolic entailment cones are defined: each “parent” point in hyperbolic space defines a cone of child points encapsulating more specific concepts. The cone half-aperture narrows with increasing radial distance, enforcing that specific points fall within the entailment region of their parent. Losses penalize embeddings that violate this soft containment (Desai et al., 2023, Pal et al., 2024, Chen et al., 8 Jan 2026).

Table 1 below summarizes common hyperbolic contrastive and entailment strategies:

Model Hyperbolic Loss Entailment Geometry
MERU(Desai et al., 2023) Lorentzian contrastive Hyperbolic cones
HyCoCLIP(Pal et al., 2024) Lorentz contrastive (boxes) Compositional cones
HyperAlign(Chen et al., 8 Jan 2026) Regression + entailment cone loss Dynamic cones, exterior angle

3. Density-Based Hyperbolic Representations and Semantic Uncertainty

Point embeddings are insufficient to represent semantic ambiguity or uncertainty in cross-modal pairs; for example, a medical image may have multiple plausible textual interpretations. HYDEN (Qiao et al., 2024) introduces hyperbolic density embeddings, where each image and text is mapped to a pseudo-Gaussian probability density on the hyperboloid manifold. The mean and covariance of each distribution are produced by small MLPs operating in the Euclidean tangent space at the hyperbolic origin, then mapped to the manifold.

Partial-order is then modeled via encapsulation: a density ff is said to be entailed (contained) by gg (fgf\preceq g) if its superlevel sets are contained in those of gg for all thresholds. Encapsulation loss softly penalizes violations, typically using asymmetric Rényi-α\alpha divergence. This allows for explicit modeling and control of semantic uncertainty and one-to-many relationships in image-text alignment, especially critical in domains such as radiology (Qiao et al., 2024).

4. Hierarchical and Compositional Representation Learning

Hierarchical and compositional learning in hyperbolic representations extends beyond just aligning image and text pairs. For instance, HyCoCLIP (Pal et al., 2024) exploits not only full-image/full-text alignment but also compositional sub-structures: object boxes in images linked via phrase grounding to noun-chunked sub-phrases in captions. This allows the construction and enforcement of nested composition and entailment relationships—e.g., full image → box; caption → noun-phrase—across both modalities.

The combination of hyperbolic contrastive alignment and compositional entailment cones allows models to encode explicit hierarchical structure. Empirical evaluations show that such compositional approaches lead to improved zero-shot classification, retrieval, and hierarchical metrics compared to both Euclidean CLIP and prior hyperbolic point-based methods (Pal et al., 2024).

5. Hyperbolic Alignment Assessment and Adaptive Scoring

Hyperbolic geometry is also exploited for alignment assessment. HyperAlign (Chen et al., 8 Jan 2026) introduces an adaptive text-to-image alignment assessment pipeline where CLIP features are lifted to Lorentz hyperbolic space, and their alignment is evaluated both by classic Euclidean cosine similarity and hyperbolic entailment cones. A sample-adaptive modulation regressor predicts weights for cosine similarity based on hyperbolic geometric features—distance, exterior angle, and cone aperture—providing increased robustness and improved agreement with human judgment in automatic alignment scoring.

Dynamic supervision is employed by modulating cone aperture by ground-truth scores, ensuring that highly aligned pairs fall within narrow cones and poorly aligned pairs within wider cones. This mechanism is validated by improved Spearman SRCC and PLCC on multiple text-to-image alignment benchmarks (Chen et al., 8 Jan 2026).

6. Generative Modeling, Sampling, and Diversity Control

Hyperbolic latent spaces have been leveraged for few-shot and controllable image generation. The HypDAE diffusion autoencoder (Li et al., 2024) combines CLIP-based encoders, hyperbolic mappers (in the Poincaré ball), and stable diffusion decoders to map images and text into a shared hierarchical code. Geodesic distance from the origin controls semantic specificity: small radii yield more abstract content, larger radii encode fine details.

Sampling in the hyperbolic ball allows for explicit trade-off between diversity and quality: the radius of sampled codes around a reference point controls semantic variability among generated images. Geodesic interpolation enables smooth, artifact-free traversals between concepts, and hyperbolic delta-mapping facilitates text-driven edits and fusion operations. This approach achieves state-of-the-art few-shot generation quality, interpretability, and diversity control (Li et al., 2024).

7. Applications, Generalization, and Impact

Hyperbolic image-text representations have demonstrated benefits across vision-language retrieval, zero-shot classification, medical image-report linking, neuroimaging meta-analysis, and generative modeling. Notable empirical findings include:

  • MERU and HyCoCLIP consistently outperform CLIP and other Euclidean methods on hierarchical tasks, with significant gains in zero-shot and compositional generalization (Desai et al., 2023, Pal et al., 2024).
  • HYDEN’s density encapsulation leads to improved zero-shot AUC and F1 on tasks such as RSNA Pneumonia, SIIM-ACR Pneumothorax, and ChestXray14 retrieval benchmarks, and ablation shows that both the use of Rényi divergence and the encapsulation loss are necessary for top performance (Qiao et al., 2024).
  • MNM leverages Lorentz hyperbolic embeddings for joint brain-text meta-analysis, successfully aligning textual neuroimaging descriptions with brain activation maps and preserving structural neuroscience hierarchies, outperforming Euclidean contrastive baselines in retrieval and activation prediction (Baek et al., 26 Nov 2025).
  • HyperAlign provides improved, sample-adaptive interpretability and robustness in alignment assessment over prior approaches (Chen et al., 8 Jan 2026).

These approaches suggest broad utility for any domain in which multimodal data exhibits latent hierarchy and uncertainty: product images and descriptions, structured medical archives, knowledge graphs, and compositional scene understanding. When semantic ambiguity or hierarchy is central, the negative curvature and partial-order modeling capabilities of hyperbolic spaces provide structural advantages that are inaccessible to flat Euclidean embeddings.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hyperbolic Image-Text Representations.