KeypointDeformer: Unsupervised Geometry Control

Updated 10 November 2025

KeypointDeformer is a method that learns semantically meaningful, sparse keypoints to enable unsupervised deformation of 2D images and 3D shapes.
It employs a PointNet-type encoder and a differentiable cage-based mapping to preserve structural details during shape manipulation.
The framework balances similarity, keypoint, and influence regularizers to achieve robust shape alignment and user-controllable deformations.

KeypointDeformer methods refer to a class of architectures that discover or leverage semantically meaningful, sparse keypoints to enable intuitive, efficient, and controllable deformation of 2D images or 3D shapes in an unsupervised or weakly supervised fashion. These frameworks have been developed for diverse application domains including unsupervised 3D shape alignment and control (Jakab et al., 2021), joint retrieval and deformation of 3D CAD models (Zhang et al., 15 Mar 2024), and keypoint-guided image manipulation using diffusion models (Oh et al., 16 Jan 2024). The central paradigm is to learn keypoints that serve as interpretable “handles” for semantics-aware deformation, thus enabling tasks ranging from CAD shape correspondence to video frame interpolation.

1. Unsupervised 3D Keypoint Discovery and Deformation

KeypointDeformer (Jakab et al., 2021) addresses the challenge of controlling and aligning 3D objects without annotated keypoints or deformation correspondences. Given two 3D shapes $x$ (source) and $x'$ (target) from the same object category, the framework comprises:

A keypoint predictor $\Phi$ that maps any input point cloud $x \in \mathbb{R}^{3 \times N}$ to an ordered set of $K$ 3D keypoints $p = (p_1,\ldots,p_K) \in \mathbb{R}^{3 \times K}$ .
A deformation model $\Psi$ that, conditioned on the source shape $x$ , its keypoints $p$ , and the target keypoints $p' = \Phi(x')$ , produces a deformed mesh $x^* \approx x'$ .

The overall deformation pipeline consists of:

Keypoint Discovery: A PointNet-type Siamese encoder computes $p = \Phi(x)$ and $p' = \Phi(x')$ .
Cage Extraction: A geometric “cage” with vertices $c_V \in \mathbb{R}^{3 \times C}$ snugly encapsulates the shape, forming a coarse control mesh.
Keypoint-Driven Deformation: The displacement $\Delta p = p' - p$ is computed, and an influence matrix $W \in \mathbb{R}^{C \times K}$ is formed as $W(x) = W_C + W_I(x)$ , where $W_C$ is a canonical matrix and $W_I(x)$ is a learned, instance-specific offset. The cage vertices are deformed linearly:

$c^*_V = c_V + W \Delta p$

The original mesh or point cloud is then deformed by a differentiable cage-based mapping $\beta(\cdot)$ (e.g., using mean-value coordinates):

$x^* = \beta(x, c, c^*)$

This structure enables completely unsupervised discovery of keypoints that are semantically and geometrically consistent across category instances.

2. Training Objectives and Regularization

The end-to-end objective balances three terms:

Similarity Loss: Bidirectional Chamfer distance between deformed and target point clouds:

$L_{\mathrm{sim}} = \text{Chamfer}(x^*, x')$

Farthest-Point Keypoint Regularizer: Drives keypoints to be evenly distributed and lie on the object surface by introducing stochastic farthest-point-sampled points $q$ from $x$ and penalizing the Chamfer distance $\text{Chamfer}(p, q)$ .
Influence Matrix Regularizer: Frobenius-norm penalty $\| W_I(x) \|_F^2$ to restrict overfitting of instance-specific influence.

The composite loss is

$L = L_{\mathrm{sim}} + \alpha_{\mathrm{kpt}} L_{\mathrm{kpt}} + \alpha_{\mathrm{inf}} L_{\mathrm{inf}}$

where hyperparameters are chosen so no term overwhelms the others during training.

As a result, the network discovers low-dimensional, semantics-preserving latent keypoints, permitting both automatic shape alignment and user-controllable part manipulations.

3. Deformation Algorithms and User Control

Once trained, the KeypointDeformer model can propagate arbitrary user-specified edits of shape semantics by allowing direct manipulation of latent keypoints. The keypoint-driven cage deformation ensures smooth, detail-preserving transformations across a range of shape categories such as airplanes, chairs, and cars.

The influence matrix mediates how each individual keypoint impacts each cage vertex, and the differentiable cage mapping ensures gradients flow from output shape similarity losses to both the keypoints and deformation parameters. Notably, the framework supports skipping the explicit target shape $x'$ at test time, enabling open-ended shape generation and intuitive editing.

4. Experimental Evaluation and Quantitative Outcomes

Datasets: ShapeNet CAD models (chair, airplane, car, motorbike, table), Google Scanned Objects (real 3D scans, e.g. shoes).

Metrics:

Semantic-part correlation (keypoints landing on consistent annotated parts)
Percentage of Correct Keypoints ([email protected])
Chamfer distance for alignment

Key empirical results ((Jakab et al., 2021), Tables 1, 4, 6d):

Metric	KeypointDeformer	Prior Unsupervised	Man. Keypoints
Semantic corr. (airplane)	0.85	0.78 / 0.69	—
[email protected] (airplane)	0.61	0.49 / 0.36	—
Align. Chamfer ( $\times 10^{-3}$ )	3.02	>5.9	4.20

Qualitative results show that moving a single semantic keypoint (e.g. airplane wingtip) appropriately deforms the associated part while retaining nearby geometry (e.g. engines, windows) and symmetry.

5. Ablation Studies and Variants

Ablation analysis highlights:

The farthest-point keypoint regularizer ( $L_{\mathrm{kpt}}$ ) prevents collapse of all keypoints to a shape centroid or single part.
The influence matrix penalty controls overfitting to individual shapes.
There is an optimal range in the number of keypoints ( $K=8$ –12): too few cannot represent fine layout, too many keypoints collapse despite regularization.
Differentiable cage-based mapping $\beta(\cdot)$ is essential for surface detail preservation; alternative cage construction methods showed only minor differences in performance.

6. Limitations and Future Directions

Key limitations include:

The method assumes input shapes are approximately aligned (alignment is handled by off-the-shelf ICP or normalization; severe misalignments compromise keypoint consistency).
The current formulation models only translations at each keypoint; it does not natively capture local rotations or articulated part transformations.
Handling topological changes (e.g., adding/removing handles), scaling to very high-resolution meshes, and sensitivity to input point cloud resolution constitute open research directions.
The unsupervised keypoints offer promising potential for robotics (as morphology-aware manipulation handles), 3D retrieval, and self-supervised scene understanding.

Subsequent works such as KP-RED (Zhang et al., 15 Mar 2024) extend the KeypointDeformer paradigm to joint retrieval and deformation: learned category-consistent keypoints control both a retrieval space (via transformer-aggregated local-global embeddings) and a neural cage deformation pipeline. This achieves state-of-the-art robustness to partial shape inputs and strong real-world scan alignment. Keypoint-based methods have further inspired controllable image and video generation via keypoint-guided diffusion models (Oh et al., 16 Jan 2024), and efficient keypoint extraction and descriptor learning in high-dimensional visual geolocalization (Zhao et al., 2023).

The emergence of KeypointDeformers marks a systematic advance in unsupervised geometric representation, aligning with a broader movement in geometric learning towards interpretable, low-dimensional and task-driven latent control of complex visual and spatial domains.