SE(3)-Equivariant Graph Embeddings

Updated 2 December 2025

SE(3)-equivariant graph embeddings are structured representations that predictably transform under 3D rotations and translations, preserving geometric integrity.
Architectural implementations include equivariant message passing, grid-structured latent codes, and implicit neural field models, enabling applications in molecular modeling, protein structures, and volumetric reconstructions.
Advanced methods such as vector-quantized VAEs and tensor-normal embeddings achieve efficient compression and high fidelity in representing complex 3D structures while retaining symmetry properties.

SE(3)-equivariant graph embeddings are representations of structured data—particularly graphs and geometric objects—that are equivariant under the special Euclidean group in three dimensions, SE(3). This group encompasses all rigid-body transformations in $\mathbb{R}^3$ , combining 3D rotations and translations. SE(3)-equivariant architectures and methods design neural features or latent codes such that, under any SE(3) transformation of the input, the output transforms in a mathematically prescribed way (equivariance), rather than changing arbitrarily (invariance) or not at all (lack of structure). These methods are central for modeling natural data with geometric and relational structure, including molecules, proteins, medical volumes, and three-dimensional scenes, where physical symmetries must be preserved by design.

1. Mathematical Formulation of SE(3)-Equivariant Embeddings

Given a group action $g \in \mathrm{SE}(3)$ acting on a structured input $X$ (such as a graph or grid in $\mathbb{R}^3$ ), a mapping $f$ is SE(3)-equivariant if

$f(g \cdot X) = \rho(g) f(X)$

where $\rho(g)$ is a prescribed group representation on the embedding space. For 3D data, $g$ consists of a rotation $R \in \mathrm{SO}(3)$ and translation $t \in \mathbb{R}^3$ , and the representation $\rho(g)$ defines how features transform—e.g., as scalars (invariant), vectors, or higher-order tensors.

SE(3)-equivariant embeddings can be realized via several architectural choices:

Equivariant message-passing in graph neural networks, where each node or edge carries features that transform as irreducible representations under SE(3).
Grid-based latent codes with explicit spatial structure, equipped with parameterizations (e.g., matrix-variate normals or tensor normals) such that spatial correlations and equivariance are preserved (Wang et al., 2017).
Implicit field representations (e.g., neural radiance fields) conditioned on latent codes exhibiting equivariant properties (Kosiorek et al., 2021).

2. Grid-Structured and Tensor Normal Embeddings

Classical VAEs typically use vector latent codes, losing explicit spatial structure. To address this, spatial VAEs replace $1 \times 1$ Gaussian codes with $m \times n$ matrices sampled from matrix-variate normal (MVN) distributions (Wang et al., 2017). The MVN density for $Z \in \mathbb{R}^{m \times n}$ is:

$p(Z|U,\Sigma_r, \Sigma_c) = \frac{\exp\left(-\frac{1}{2} \mathrm{tr}[\Sigma_r^{-1}(Z-U)\Sigma_c^{-1}(Z-U)^\top]\right)}{(2\pi)^{mn/2} |\Sigma_r|^{n/2} |\Sigma_c|^{m/2}}$

where $U$ is the mean, and $\Sigma_r$ and $\Sigma_c$ encode row and column covariances, specifying correlated fluctuations across the grid. Low-rank parameterizations further reduce parameters and encourage global structure.

This approach generalizes to volumetric (3D) latent tensors using tensor-normal distributions, parameterized by mode-wise covariances (for axes $\ell,m,n$ ):

$\operatorname{vec}(T) \sim \mathcal{N}(\operatorname{vec}(U), \Sigma_1 \otimes \Sigma_2 \otimes \Sigma_3)$

Such grid-structured embeddings inherently reflect translation (and, if features are chosen appropriately, rotation) equivariance (Wang et al., 2017).

3. Latent Modeling Across Slices and Volumetric Assemblies

To construct embeddings that maintain long-range 3D consistency, slice-wise and volumetric methods model inter-slice correlations in latent space. For 3D MR brain volumes, a 2D slice VAE is trained per slice, producing latent means $\mu_\phi(X(t)) \in \mathbb{R}^L$ for each position $t$ . For each latent dimension $l$ , its behavior across $T$ slices is aggregated as $y^{(i)}_l = [y^{(i)}_l(1),...,y^{(i)}_l(T)]^\top$ for volume $i$ , building data matrices $Y_l \in \mathbb{R}^{T \times N}$ . The sample mean and covariance are then:

$\mu_l = \frac{1}{N} \sum_{i=1}^N y^{(i)}_l, \qquad \Sigma_l = \frac{1}{N} \sum_{i=1}^N [y^{(i)}_l - \mu_l][y^{(i)}_l - \mu_l]^\top$

Sampling new coherent latent stacks is achieved by generating $z_l \sim \mathcal{N}(0, I_T)$ and setting $y_l = \Sigma_l^{1/2}z_l + \mu_l$ , then decoding each vector as a slice to form a 3D volume. This explicitly enforces anatomical consistency and grid structure across the embedding volume (Volokitin et al., 2020).

4. Vector Quantization and Equivariant Volumetric Compression

In vector-quantized VAEs (VQ-VAE), encoder outputs for 3D volumes are quantized against a codebook at every spatial location, yielding discrete grid-shaped embeddings $z_q(x)_p = e_{\arg\min_j \|z_e(x)_p - e_j\|_2}$ for voxel $p$ , where $e_j \in \mathbb{R}^D$ are learned centroids (Tudosiu et al., 2020). The entire 3D volume embedding is thus a spatial grid of codebook entries, preserving neuromorphological structure and enabling extremely high compression rates (down to $0.825\%$ of original size) while maintaining spatial and anatomical fidelity. The code grid structure directly supports translation equivariance; explicit use of higher-order or vector features would further extend this to rotation equivariance. The multidimensional code structure is compatible with advanced loss functions (e.g., 3D DCT-domain robust loss) and segmentation-based evaluation metrics, evidencing preservation of meaningful geometric and topological relationships.

5. SE(3)-Equivariance in Implicit 3D Representations

Implicit neural field models such as NeRF-VAE embed equivariant structure through the design of both the latent space and the decoding function. Here, a compact latent vector $z$ is inferred for each scene, and the NeRF-style decoder $G_\theta(\mathbf{x}, \mathbf{d} | z)$ —which maps 3D position $\mathbf{x}$ and view direction $\mathbf{d}$ to density and color—is conditioned on $z$ via two mechanisms:

“Shift & scale” MLP conditioning, equivariant by design to the latent representation;
Attention-based spatial conditioning, where $z$ is reshaped into a learnable 3D grid $Z$ over which queries (Fourier-lifted $(\mathbf{x}, \mathbf{d})$ ) attend. The output at each ray position is thus a function of local context within the grid, preserving translation and—if properly parameterized—rotation equivariance (Kosiorek et al., 2021).

The underlying differentiable volume rendering step holds geometry explicit throughout the process. Each rendered view is synthesized by integrating predicted color/density along rays, ensuring that equivariant input transformations propagate through to image outputs in a predictable and mathematically prescribed way.

6. Evaluation Metrics and Practical Considerations

The assessment of SE(3)-equivariant embeddings depends on both standard image and volumetric metrics and novel, geometry-aware evaluations:

Maximum Mean Discrepancy (MMD) and Multiscale Structural Similarity (MS-SSIM) to quantify fidelity and diversity in generated volumes (Volokitin et al., 2020, Tudosiu et al., 2020);
Segmentation-based metrics such as the Realistic Atlas Score (RAS)—based on multi-label Dice overlaps after affine spatial registration—which directly reflect anatomical and structural realism in the embedding's decoded outputs (Volokitin et al., 2020);
Morphological fidelity assessments via voxel-based morphometry (VBM) and Dice overlap for key tissue classes, verifying the preservation of fine-scale geometrical features after latent embedding, quantization, and decoding (Tudosiu et al., 2020).

Computational efficiency is addressed via several strategies: by factorizing high-dimensional distributions using block-diagonal or low-rank forms, exploiting parallelism across slices or spatial locations, and leveraging grid- and tensor-based parameterizations to maintain equivariance without requiring full dense 3D convolutions (Wang et al., 2017, Volokitin et al., 2020).

7. Broader Significance and Extensions

SE(3)-equivariant graph embeddings are foundational for scientific and engineering domains where symmetries must be strictly enforced. Methods described here demonstrate, through architectures such as spatial VAEs with matrix- or tensor-normal latents (Wang et al., 2017), slice-VAE with inter-slice Gaussian models (Volokitin et al., 2020), multi-scale vector-quantized VAEs (Tudosiu et al., 2020), and implicit neural field models with spatial grid attention (Kosiorek et al., 2021), that spatial coherence and geometric equivariance are compatible with high-capacity generative modeling, compression, and downstream structure-aware tasks.

A plausible implication is that, while many designs focus on translation equivariance by encoding grid structure in the latent space, true SE(3)-equivariance—encompassing rotation and translation—requires careful architectural choices in the covariance parameterization, codebook semantics, and neural module construction. These trends anticipate further research into higher-order tensor embeddings, continuous-equivalent grid parameterizations, and group-convolutional neural architectures, pushing the expressive power and symmetry guarantees of graph and volumetric encodings across scientific machine learning.