Siegel Neural Networks

Updated 14 November 2025

Siegel neural networks are discriminative architectures defined on Siegel spaces, which generalize SPD matrices and complex-hyperbolic geometry.
They employ novel formulations for multiclass logistic regression and fully-connected layers with Riemannian optimization, achieving state-of-the-art performance on radar clutter and node classification tasks.
The design leverages closed-form layer constructions and group symmetries, but faces challenges in parameter efficiency and computational overhead.

Siegel neural networks are a class of discriminative architectures defined over Siegel spaces: Riemannian symmetric spaces (RSS) generalizing both symmetric positive definite (SPD) matrices and complex-hyperbolic geometry. By leveraging the quotient structure and symmetries of Siegel upper half-spaces $\mathbb{SH}_m$ , these networks enable learning and classification with data that naturally reside on disconnected or highly curved geometric domains. Siegel neural networks introduce new formulations for multiclass logistic regression (MLR) and fully-connected (FC) layers, allowing end-to-end training with Riemannian optimization tools. The approach yields state-of-the-art performance on radar clutter classification and node classification tasks.

1. Geometric Foundation: The Siegel Upper Half-Space

The Siegel upper half-space of complex dimension $m$ is defined as

$\mathbb{SH}_m = \left\{x = u + iv \,\,\Big|\,\, u \in \Sym_m, \, v \in \Sym_m^+ \right\},$

where $\Sym_m$ denotes the $m \times m$ real symmetric matrices and $\Sym_m^+$ denotes symmetric positive definite matrices of the same size.

Siegel spaces possess a transitive isometric action by the real symplectic group

$\Sp_{2m} = \left\{\begin{pmatrix} a & b \ c & d \end{pmatrix}: ab^T = ba^T,\, cd^T = dc^T,\, ad^T - bc^T = I_m \right\},$

through generalized Möbius transformations: $s = \begin{bmatrix} a & b \ c & d \end{bmatrix}: \quad x \mapsto (a x + b)(c x + d)^{-1}.$ The stabilizer of $iI_m$ is $\SpO_{2m} = \Sp_{2m} \cap O_{2m}$, making the symmetric space realization explicit: $\mathbb{SH}_m \cong \Sp_{2m} / \SpO_{2m},$ with rank $m$ and nonpositive sectional curvature.

A canonical $G$ -invariant metric on $\mathbb{SH}_m$ is: $ds^2|_{x = u + iv} = 2\,\mathrm{Tr} \left( v^{-1} dx\, v^{-1} d\overline{x} \right).$ For $x(t) = u(t) + i v(t)$ ,

$\| \dot{x} \|_x^2 = 2 \,\mathrm{Tr}\left( v^{-1} \dot{u} \, v^{-1} \dot{u} + v^{-1} \dot{v} \, v^{-1} \dot{v} \right).$

On any noncompact RSS $X = G/K$ , one defines a vector-valued (Weyl-chamber-valued) distance $d_\Delta(x, y) \in \Delta$ as the $G$ -congruence-invariant translation in a fixed maximal flat. This metric structure underpins the network constructions.

2. Layer Construction on $\mathbb{SH}_m$

2.1 Multiclass Logistic Regression (MLR)

In Euclidean settings, MLR relies on linear scoring: $p(y=j|x) = \frac{\exp(\langle a_j, x \rangle - b_j)}{ \sum_{k=1}^M \exp(\langle a_k, x \rangle - b_k)}.$ This is interpreted as proportional to the exponential of the signed distance from $x$ to a class hyperplane.

For $\mathbb{SH}_m$ , two MLR constructions are defined:

(i) Quotient-Structure MLR (QMLR)

A class hyperplane is parameterized by points $a_j, p_j \in \mathbb{SH}_m$ . The signed distance (Thm 2.1) is

$\bar d\left( x, \mathcal{H}_{a_j, p_j} \right) = \frac{ \big| \langle \log(\phi(p_j)^{-1} \phi(x) \phi(x)^T \phi(p_j)^{-T}),\, \log(\phi(a_j)\phi(a_j)^T) \rangle \big| } { \| \log(\phi(a_j) \phi(a_j)^T) \| },$

where

$\phi(\cdot): u + iv \mapsto \begin{bmatrix} v^{1/2} & u v^{-1/2} \ 0 & v^{-1/2} \end{bmatrix} \in \Sp_{2m}.$

Class scores and probabilities are then: $s_j(x) = \mathrm{sign} \langle \ominus p_j \oplus x, a_j \rangle \cdot \| \log(\phi(a_j)\phi(a_j)^T) \| \cdot \bar d(x, \mathcal H_{a_j,p_j}),$

$p(y=j|x) \propto \exp(s_j(x)).$

(ii) Vector-Valued-Distance MLR (VMLR)

Fix a direction $a_j\in\Delta$ (Weyl chamber) and basepoint $p_j\in\mathbb{SH}_m$ . Define

$\mathcal H_{\xi_j, p_j} = \{ x : \langle d_\Delta(x, p_j), a_j \rangle = 0 \}.$

The distance upper bound (Prop 2.7) is

$\bar d(x, \mathcal{H}_{\xi_j, p_j}) \leq \langle d_\Delta(x, p_j), a_j \rangle.$

Set

$s_j(x) = \pm\,\langle d_\Delta(x, p_j), a_j \rangle;\quad p(y=j|x) \propto \exp(s_j(x)).$

In both cases, the cross-entropy loss is

$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^N \log p(y_i | x_i).$

2.2 Fully-Connected (FC) Layers

Two FC designs are given for $\mathbb{SH}_m$ :

(i) Affine via Group Action (AFC)

Let weights be $(a, b)$ with $a\in\Sym_m$, $b\in\Sym_m^+$: $\phi(a+ib) = \begin{pmatrix} b^{1/2} & a b^{-1/2} \ 0 & b^{-1/2} \end{pmatrix},$

$x = u + iv \mapsto t = (b^{1/2} u b^{1/2} + a) + i (b^{1/2} v b^{1/2}).$

(ii) Dimensionality-Reducing FC (DFC)

Let $b\in \mathrm{St}_{m,m_2}$ (Stiefel), $a\in \Sym_{m_2}$: $t = (b^T u b + a) + i (b^T v b) \in \mathbb{SH}_{m_2}.$

Pointwise nonlinearities, such as an SPD-valued ReLU on the imaginary part, follow these mappings.

3. Training Procedures and Riemannian Optimization

3.1 Riemannian Backpropagation

Parameters may reside in vector spaces or on manifolds:

For $b \in \Sym_m^+$ (SPD): Project gradients onto $\Sym_m$ and update via exponential retraction:

$b \leftarrow \Exp_b\left(-\eta \nabla_b \mathcal{L}\right),\quad \Exp_b(H) = b^{1/2} \exp(b^{-1/2} H b^{-1/2}) b^{1/2}.$

For Stiefel $b \in \mathrm{St}_{m,m_2}$ : gradient step in $\mathbb{R}^{m \times m_2}$ , then QR re-orthonormalization.
For $\mathbb{SH}_m$ points: compute a tangent gradient (via Jacobians of distance) and retract (via group action or geodesic) to manifold.

Standard Riemannian optimizers, such as Riemannian SGD or Riemannian Adam (e.g., Geoopt), can be directly utilized with conventional hyperparameters.

3.2 Regularization and Projection

No additional regularization is required beyond maintaining parameter feasibility via manifold-valued retractions. Optional penalties on tangent-space parameters, such as the spectral or Frobenius norm ( $\| a \|^2_F$ , $\| a_j \|^2$ ), can control model complexity.

4. Empirical Performance and Evaluation

4.1 Applications and Experimental Setup

Radar clutter classification: Uses simulated autoregressive (AR) Gaussian time series in $\mathbb{C}^m$ (order $q$ ), summarized as $(\tilde p_0, z_1, ..., z_{q-1}) \in \Sym_m^+ \times \mathbb{SH}_m^{q-1}$. Four datasets with $(m,q)$ = (3,2), (4,2), (5,2), (6,2), varying sample sizes. Network: one FC (AFC or DFC) layer mapping to $\mathbb{SH}_m$ , followed by QMLR. Training: Riemannian Adam, learning rate 1e-3, batch size 32, 80 epochs.

Node classification: Datasets (Glass, Iris, Zoo from UCI) are small graphs. All-pairs “ground-truth” cosine distances are embedded into $\mathbb{SH}_6$ by minimizing

$\mathcal{L}_{\text{embed}} = \sum_{ij} \left| \left(\frac{d_{\mathbb{SH}}(x_i, x_j)}{d_G(i, j)}\right)^2 - 1 \right|.$

Network: AFC $\to$ QMLR or VMLR. Training: Riemannian Adam, learning rate 1e-3, 100 epochs.

4.2 Key Quantitative Results

Table 1. Radar Clutter Classification (mean ± std over 10 runs)

Method	Dataset 1	Dataset 2	Dataset 3	Dataset 4
kNN (Kähler dist.)	76.22	93.00	76.75	73.20
SPDNet [17]	63.44	41.50	45.88	66.80
SiegelNet–AFC–QMLR (Ours)	80.94	96.50	91.00	85.60

Table 2. Node Classification

Method	Glass	Iris	Zoo
kNN	29.65	31.66	33.33
LogEig [21]	41.54	34.33	51.04
SiegelNet–BFC–BMLR [25]	41.12	37.26	48.12
SiegelNet–AFC–QMLR (Ours)	45.79	38.20	53.37

Siegel neural networks demonstrate superior performance across all datasets compared to SPD-based and kNN baselines.

5. Analysis, Limitations, and Prospects

5.1 Advantages

Expressivity: Siegel spaces naturally generalize SPD and complex-hyperbolic settings, enabling the representation of intricate correlations and dependencies.
Closed-form FC layers: The symplectic group action allows explicit formulae for affine mappings within the space.
Empirical results: State-of-the-art accuracy on radar signal and node classification benchmarks.

5.2 Limitations

Parameter efficiency: QMLR structure requires two points per class, effectively doubling the parameter count relative to Euclidean and SPD analogues.
Computational overhead: Riemannian distance calculations involve eigen-decompositions and matrix logarithms. Retractions and Cayley transforms further increase computation.
Curvature restriction: Only nonpositive curvature is supported; thus, structures with intrinsic positive curvature are not accommodated.
Architectural scope: Convolutional, batch-normalization, pooling, and attention layers on $\mathbb{SH}_m$ have not been developed.

5.3 Potential Extensions

Compact MLR: Design of more parameter-efficient Siegel hyperplane representations.
Convolutional layers: Definition of local Siegel-valued filters via horospheres or $K$ -equivariant constructions.
Horospherical nonlinearities: Proposals to mimic ReLU via projections onto convex Weyl chambers.
Generative models: Development of Riemannian normalizing flows on $\mathbb{SH}_m$ for generative modeling.
Hybrid manifolds: Integration of Siegel spaces with other curvature components in product manifold networks.

Siegel neural networks formalize geometric deep learning within a rich family of symmetric spaces, providing theoretical and practical advances for data with complex intrinsic geometry.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Siegel Neural Networks.