Hyperbolic Timbre Embedding

Updated 22 June 2026

Hyperbolic timbre embedding is a method that uses constant negative curvature to represent hierarchical, tree-structured audio timbre data.
It leverages models like the Poincaré ball and Lorentz hyperboloid to generate compact, order-sensitive embeddings for tasks such as instrument classification and effect chain recognition.
The approach improves performance in audio applications by preserving semantic hierarchies and enabling efficient fusion of heterogeneous acoustic representations.

Hyperbolic timbre embedding refers to the use of hyperbolic geometry—spaces of constant negative curvature—to represent and manipulate audio timbre, particularly in contexts where underlying acoustic or semantic relationships are hierarchical, tree-structured, or exponentially growing. Traditional Euclidean embeddings exhibit polynomial volume growth and impose limitations for capturing complex hierarchical relationships prevalent in musical timbre, instrument taxonomy, audio effects chains, and the fusion of heterogeneous audio representations. Hyperbolic methods leverage models such as the Poincaré ball or Lorentz hyperboloid, enabling more compact, order-sensitive, and hierarchy-preserving embeddings for timbre-centric audio tasks.

1. Mathematical Foundations

Hyperbolic geometry enables exponential expansion of space, contrasting the linear or polynomial volume growth in Euclid. This property is particularly exploited using models such as the Poincaré ball and Lorentz (hyperboloid), parameterized by curvature $K<0$ (hyperboloid) or $-c<0$ (Poincaré ball). The $n$ -dimensional Poincaré ball is defined as $B^n_c = \{ x \in \mathbb{R}^n \mid c\|x\|^2 < 1 \}$ , with metric $g_c(x) = (\lambda_x^c)^2 I_n$ and conformal factor $\lambda_x^c = 2/(1 - c\|x\|^2)$ (Wada et al., 28 Jul 2025, Phukan et al., 3 Jun 2025). Central operations include:

Exponential Map: Projects a tangent vector $v\in\mathbb{R}^n$ to the manifold,

$\exp_0^c(v) = \frac{\tanh(\sqrt{c}\|v\|)}{\sqrt{c}\|v\|} v , \quad v \neq 0$

Möbius Addition: A non-commutative "vector addition" in the Poincaré ball,

$x \oplus_c y = \frac{ (1 + 2c\langle x, y \rangle + c\|y\|^2) x + (1 - c\|x\|^2) y }{ 1 + 2c\langle x, y \rangle + c^2 \|x\|^2 \|y\|^2 }$

Distance: The geodesic distance,

$d_c(x, y) = \frac{2}{\sqrt{c}} \tanh^{-1} ( \sqrt{c}\| -x \oplus_c y \| )$

As $-c<0$ 0, these converge to their Euclidean analogues (Wada et al., 28 Jul 2025, Phukan et al., 3 Jun 2025).

The Lorentz model defines the $-c<0$ 1-dimensional hyperboloid as

$-c<0$ 2

with the Lorentzian inner product $-c<0$ 3 (Nakashima et al., 2022).

2. Rationale for Hyperbolic Timbre Embedding

Hyperbolic space is uniquely suited for embedding tree-like, hierarchical, and order-dependent timbral data for several reasons:

Hierarchical Instrument Taxonomies: The Hornbostel–Sachs instrument classification and similar schemes are tree-structured, with exponentially branching layers (Nakashima et al., 2022).
Order-Sensitive Effect Chains: Audio effect (AFX) chains can be modeled as trees with nodes as effects and edges encoding order; the number of unique permutations grows exponentially with chain depth (Wada et al., 28 Jul 2025).
Timbre Representations: Fine-grained timbral distinctions and their semantic categories are often embedded in hierarchical feature taxonomies (Phukan et al., 3 Jun 2025).

Hyperbolic space's exponential volume growth enables the faithful embedding of such structures with reduced distortion compared to Euclidean space, leading to improved downstream task performance and more meaningful geometric relationships among audio representations (Nakashima et al., 2022, Wada et al., 28 Jul 2025).

3. Principal Architectures and Embedding Pipelines

3.1 Effect Chain Recognition (Poincaré Ball)

In order-aware AFX chain classification, the pipeline is as follows (Wada et al., 28 Jul 2025):

Feature Extraction: Raw audio waveform is encoded via a large-scale pretrained model (MERT) into a 1024-dimensional Euclidean embedding.
Projection: Sequential fully connected blocks reduce and project to $-c<0$ 4.
Hyperbolic Mapping: The vector is mapped into the Poincaré ball via the exponential map at the origin.
Classification (Hyperbolic Multinomial Logistic Regression, MLR): Each class (corresponding to an effect chain) is parameterized by a support point $-c<0$ 5 in $-c<0$ 6 and a tangent-space normal $-c<0$ 7. Chain probabilities are assigned based on hyperbolic geometry-derived decision boundaries.

3.2 Instrument Sound Synthesis (Hyperboloid VAE)

The variational autoencoder (VAE) for musical instrument sound synthesis employs (Nakashima et al., 2022):

Input: Mel-spectrogram (256 × 43) and pitch label.
Encoder: CNN stack maps to a DNN embedding; pitch and timbre components are separated, with timbre embedded in the tangent space and mapped to $-c<0$ 8 via the exponential map.
Latent Codes: Disentanglement into Euclidean pitch code $-c<0$ 9 and hyperbolic timbre code $n$ 0.
Decoder: Transposed CNN reconstructs the mel-spectrogram from concatenated codes.
Latent Sampling: Timbre latent follows a pseudo-hyperbolic Gaussian for hierarchy-inducing structure.

3.3 Speech Representation Fusion (HYFuse)

In speech emotion recognition, the HYFuse system fuses timbral (compression-based, e.g., Soundstream) and semantic (representation-learning-based, e.g., x-vector/WavLM) encodings as follows (Phukan et al., 3 Jun 2025):

Encoding: CBR and RLR features are individually mapped to $n$ 1 by Conv1D backbones.
Hyperbolic Embedding: Exponential map (Eq. 1) places each encoding in a shared Poincaré ball.
Fusion: Möbius addition combines embeddings, preserving their geometric properties and order sensitivity.
Euclidean Projection: The sum is mapped back to the Euclidean tangent via the logarithmic map for final classification.

4. Empirical Evidence and Quantitative Performance

The use of hyperbolic geometry for timbre embedding yields substantial gains over Euclidean baselines across multiple domains:

Task / System	Geometry	Macro F₁ / Accuracy	Hierarchy/Order Impact	arXiv id
AFX Chain Recognition (J=128, c=1.0)	Hyperbolic	0.756 / 0.758	Best for order-sensitive metrics	(Wada et al., 28 Jul 2025)
AFX Chain Recognition (best Euclidean)	Euclidean	0.748 / 0.750	Uniformly outperformed by hyperbolic	(Wada et al., 28 Jul 2025)
MISS - Instrument Class (Dₜ=2, R=100)	Hyperbolic VAE	59.9%	Higher hierarchical separability S	(Nakashima et al., 2022)
MISS - Instrument Class (Dₜ=2, Euclid)	Euclidean VAE	39.5%	Lower hierarchical separability S	(Nakashima et al., 2022)
SER, CREMA-D (Wav2vec2+Soundstream fusion) HYFuse	Hyperbolic	79.29%	Superior class separation (t-SNE)	(Phukan et al., 3 Jun 2025)
SER, CREMA-D (Wav2vec2+Soundstream fusion) Euclidean concat	Euclidean	76.15%	Lower inter-class cluster separation	(Phukan et al., 3 Jun 2025)

Order-preserving and hierarchy-inducing properties of hyperbolic metrics resulted in improved performance for chain order recognition, instrument family clustering, and fusion of heterogeneous timbral features, with substantial improvements at low dimensionality or fine-grained settings (Wada et al., 28 Jul 2025, Nakashima et al., 2022, Phukan et al., 3 Jun 2025).

5. Representation of Hierarchy and Order

The ability to embed tree-structured or order-dependent data efficiently is central to hyperbolic timbre embedding:

AFX Chains: The non-commutativity of Möbius addition ( $n$ 2) allows embeddings to preserve effect order, matching the non-commutative nature of effect chain composition where order affects resultant timbre (Wada et al., 28 Jul 2025).
Instrument Families: Placing prototypes $n$ 3 for each instrument in $n$ 4 clusters related instruments hierarchically according to taxonomy; the hyperbolic distance between families is larger, reflecting tree separation (Nakashima et al., 2022).
Fusion: Hyperbolic fusion strategies (e.g., HYFuse) align disparate feature sets (CBR and RLR) in a shared space where both fine-grained timbral and higher-level semantic hierarchies are preserved (Phukan et al., 3 Jun 2025).

Empirically, hierarchical separability scores and cluster visualizations indicate higher correspondence to semantic structure in Poincaré or Lorentz spaces than in Euclidean analogues.

6. Training Objectives and Practical Instantiations

Across systems, optimization is based on cross-entropy or ELBO objectives that leverage hyperbolic geometry:

AFX Chain Recognition: Hyperbolic MLR employs geodesic decision boundaries. No margin or geodesic loss terms are required; the cross-entropy targets chain labels directly (Wada et al., 28 Jul 2025).
MISS VAE: The negative ELBO combines reconstruction loss, priors over hyperbolic and Euclidean latent codes, and discriminative cross-entropy for instrument and pitch prediction (Nakashima et al., 2022).
HYFuse: A standard cross-entropy is applied after mapping fused hyperbolic codes back to the Euclidean tangent space. All parameters, including optional curvature $n$ 5, are end-to-end trainable (Phukan et al., 3 Jun 2025).

Optimization is generally performed with Riemannian Adam variants (e.g., geoopt), batch normalization or layer normalization, and curvature hyperparameters tuned or learned as needed.

7. Limitations and Future Directions

Reported challenges include numerical instability in mapping operations at very high curvature (small $n$ 6) or low latent dimensionality, and the fixed-curvature assumption in standard models. Potential extensions suggest learning mixed-curvature or product spaces for more adaptable hierarchies, expanding to raw-waveform VAE for end-to-end synthesis, and incorporating deeper or finer-grained taxonomies within instrument or effect families (Nakashima et al., 2022). A plausible implication is that as datasets grow in diversity and depth, adaptive or higher-complexity hyperbolic geometries may yield further gains in modeling the intricate manifold of timbre.

Cited works: (Wada et al., 28 Jul 2025, Phukan et al., 3 Jun 2025, Nakashima et al., 2022).