EDM: 3D Keypoint Discovery & Shape Generation

Updated 10 December 2025

EDM is a generative model that unifies unsupervised 3D keypoint discovery with conditional latent diffusion, producing sparse yet robust shape representations.
The architecture employs a PointTransformer V3 backbone and attention-based keypoint extraction with Fourier positional encodings to capture local and global geometric features.
The training protocol integrates geometric, chamfer, and diffusion losses to ensure semantic consistency, high-fidelity reconstruction, and diverse generative outputs.

An Elucidated Diffusion Model (EDM) is a generative model for point cloud data that unifies unsupervised 3D keypoint discovery and generative shape modeling via a conditional latent diffusion process. In this context, EDMs provide a framework for encoding a 3D object as a sparse set of spatially organized keypoints—which serve as the latent code for an EDM-based diffusion decoder capable of reconstructing or generating entire 3D shapes. The EDM backbone enables fully unsupervised learning of compact, interpretable keypoint representations that are consistent across intra-category shape variability, while simultaneously supporting unconditional and conditional generation of 3D shapes from these concise such codes (Newbury et al., 3 Dec 2025).

1. Core Model Architecture

The EDM-based framework comprises two principal modules: a keypoint encoder and a conditional Elucidated Diffusion Model (EDM) decoder. The encoder processes an unordered point cloud $S_0 = \{x_i \in \mathbb{R}^3\}_{i=1}^N$ with a PointTransformer V3 backbone, encoding both local and global geometric features. Coordinates are lifted via learnable Fourier positional encodings $\gamma(x_i) \in \mathbb{R}^{2F}$ . Keypoint discovery leverages a series of learnable query vectors $q_k \in \mathbb{R}^D$ applied via multi-head cross-attention to the per-point features, producing $d$ attention weight vectors $a_{k,i} \geq 0$ (with $\sum_{i=1}^N a_{k,i}=1$ for each $k$ ). Each keypoint is a convex combination $K_k = \sum_{i=1}^N a_{k,i} x_i$ , yielding the keypoint set $K \in \mathbb{R}^{d \times 3}$ .

A global auxiliary latent $z_{\text{aux}}$ is sampled from a learned Gaussian $(\mu, \log\sigma^2)$ , parameterized by an MLP on the pooled attention queries. The complete latent representation is $z_0 = \text{vec}(K) \| z_{\text{aux}} \in \mathbb{R}^{3d + m}$ . This is used as the conditioning input for the diffusion decoder.

The diffusion decoder, following the EDM paradigm [Karras et al. 2022], reconstructs the underlying shape $S_0$ from noisy inputs $S_t = S_0 + \sigma_t \epsilon$ , where $\sigma_t$ is the noise level. The denoising U-Net $F_\theta$ is conditioned throughout its layers on the latent code $z_0$ using both FiLM (feature-wise affine modulation) and cross-attention. The decoder predicts $\hat S = D_\theta(S_t, \sigma_t, z_0)$ , with a skip connection and per-step preconditioning factors as per the original EDM.

2. Keypoint Representation and Conditioning Mechanism

Keypoints are parameterized as attention-weighted means that lie strictly within the convex hull of the input point cloud, enhancing surface anchoring and geometric fidelity. The $z_{\text{aux}}$ vector encodes additional global context. In the EDM decoder, $z_0$ is split into a global channel (processed by an MLP for FiLM scaling and shifting) and a per-point channel (enabling cross-attention between diffusion features and the keypoint code). This design ensures the generative model's output remains tightly coupled to the spatial keypoints, enforcing semantic and structural alignment between the latent code and resulting 3D geometry.

Component	Role	Conditioning Mode
Keypoint vector $K$	Encodes spatial landmarks (shape support)	Injected via cross-attention
Aux latent $z_{\text{aux}}$	Captures global variations/context	FiLM modulates decoder layers
EDM decoder	Reconstructs/generates $S_0$ from latent $z_0$	Conditioned at all levels

The combination of attention-based keypoint extraction and deep conditional generative modeling enables the discovery of repeatable, interpretable landmarks with strong generative utility.

3. Learning Objectives and Losses

The total loss for EDM-based keypoint learning is a weighted sum of several supervised and unsupervised terms:

$\mathcal{L} = \lambda_0 \mathcal{L}_{\text{FPS}} + \lambda_1 \mathcal{L}_{\text{diff}} + \lambda_2 \mathcal{L}_{\text{chamfer}} + \lambda_3 \mathcal{L}_{\text{mse}} + \lambda_4 \mathcal{L}_{\text{KL}}$

Furthest-Point Sampling (FPS) Loss $\mathcal{L}_{\text{FPS}}$ : Bootstraps the placement of keypoints near representative surface anchors by minimizing Chamfer distance to farthest point sampled anchors.
Chamfer Alignment Loss $\mathcal{L}_{\text{chamfer}}$ : Ensures keypoints remain close to the object surface, promoting geometric saliency.
Deformation Consistency Loss $\mathcal{L}_{\text{mse}}$ : Enforces that under random deformations or affine transforms $\mathcal{T}$ , the predicted keypoints transform coherently, increasing semantic stability.
Diffusion/Generative Loss $\mathcal{L}_{\text{diff}}$ : Symmetrized Chamfer and repulsion losses between reconstructed and true point clouds; governs the quality of shape generation from the latent code.
KL Divergence $\mathcal{L}_{\text{KL}}$ : Regularizes the auxiliary latent to follow a standard Gaussian, facilitating sampling and interpolation.

This objective jointly optimizes the keypoints for both geometric faithfulness and generative utility, enabling the EDM to both reconstruct and synthesize novel shapes from compact spatial latent codes.

4. Training Protocol and Inference

Training is performed on collections of normalized ShapeNet point clouds (2048 points per object). Geometric data augmentation, including random rigid and nonrigid transforms, increases keypoint repeatability and generalization. A dynamic noise schedule is used for the EDM training, with $(\mu_n(e),\sigma_n^2(e))$ linearly annealed over most of the training epochs.

At inference, the model can support:

Unconditional generation: sample $z_0$ from a fitted kernel density estimate (KDE) over the keypoint and auxiliary spaces, run EDM reverse denoising.
Interpolation/editing: interpolate between discovered keypoint sets, generating continuous, semantically plausible shape morphs.
Reconstruction: encode a new object’s keypoints and reconstruct its geometry with high fidelity.

5. Empirical Results and Comparisons

EDM-based keypoint models yield significant relative gains in semantic consistency and generative performance:

Metric	KeyPointDiffuser	Best prior (SC3K, K-Deformer, etc.)
DAS (Keypoint Consistency)	0.76	$\sim$ 0.71
Keypoint-Part Correlation	0.98	0.91
Shape MMD-CD	0.016	0.017 (KD), 0.029 (DPM)

The keypoints discovered are spatially repeatable, semantically aligned across intra-class instances, and robust to deformation. Output shapes from the diffusion decoder are diverse and high quality; unconditional generation and direct latent interpolation both produce continuous shape variations with maintained geometric structure (Newbury et al., 3 Dec 2025).

6. Theoretical Significance and Extensions

EDM-based approaches establish a tight link between unsupervised spatial landmark discovery and probabilistic shape modeling. The combination of convex-hull keypoint parameterization, deformation-aware losses, and generative diffusion decoding encourages the emergence of informative, repeatable, and generative-friendly keypoints. This suggests that the structural abstraction provided by such keypoints is both necessary and sufficient for capturing the essential variability of 3D shape categories in a generative setting.

A plausible implication is that incorporating additional priors—such as part symmetry, mesh topology, or conditional attributes—could further improve semantic coherence and generative control. The EDM framework is also extensible to conditional shape generation, shape editing with attribute control, and multi-modal stochastic shape reasoning, by expanding the latent conditioning interface.

7. Limitations and Open Challenges

Reported limitations include the reliance on dense, pre-aligned, and normalized point clouds; surface outputs remain sparse and mesh extraction is not addressed. The model does not explicitly leverage part priors or semantic labels, and symmetry is only implicitly encouraged. Robustness to extreme noise, partial observation, or mesh-domain output remains an open direction.

Current EDM-based keypoint frameworks represent a significant advance by bridging unsupervised landmark learning and powerful shape synthesis, but further work is needed to realize mesh, scene-scale, or richly semantic 3D generative modeling (Newbury et al., 3 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

KeyPointDiffuser: Unsupervised 3D Keypoint Learning via Latent Diffusion Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Elucidated Diffusion Model (EDM).