UKPGAN: Self-Supervised 3D Keypoint Detector

Updated 10 November 2025

UKPGAN is a 3D keypoint detector that reformulates keypoint extraction as an information compression task to capture semantically significant points.
It employs rotation-invariant feature extraction and GAN-based sparsity control to ensure robustness and repeatability in diverse 3D environments.
Empirical evaluations show that UKPGAN achieves superior alignment with human annotations, enhancing tasks like registration, tracking, and pose estimation.

UKPGAN is a general self-supervised 3D keypoint detector that formulates keypoint detection as an information compression task: the model extracts a sparse set of spatially significant points from a point cloud, ensuring these keypoints are sufficient to reconstruct the original object. The architecture integrates GAN-based keypoint sparsity control and salient information distillation, providing keypoints that are semantically meaningful, repeatable under rigid and non-rigid transformations, and robust across synthetic and real-world data domains. Unlike hand-crafted or limited earlier approaches, UKPGAN achieves state-of-the-art alignment with human annotations and enables direct application to geometric registration, tracking, and pose estimation tasks.

1. Motivation and Conceptual Framework

The classical 3D keypoint detection task targets identification of a compact, informative subset of points from unstructured point clouds. These keypoints form the basis for geometric tasks such as matching, registration, and pose estimation. Traditional techniques—including Harris-3D, ISS, HKS, and SIFT-3D—employ hand-crafted geometric criteria but lack semantic consistency and robustness. Recent learning-based approaches, such as USIP and D3Feat, improve upon these deficits but are constrained by needs for real scan data, auxiliary training objectives, or limited sparsity control.

UKPGAN reframes keypoint detection as distilling the minimum information necessary for reconstructing an object. The approach follows the information compression paradigm: keypoints must "compress" essential structure to facilitate reconstruction. A sparsity mechanism encourages the selection of only the most salient points, echoing the assertion that perception is fundamentally an act of information compression.

2. System Architecture

The UKPGAN architecture incorporates five primary modules: (1) rotation-invariant local feature extraction, (2) dual-branch prediction, (3) GAN-based keypoint sparsity control, (4) salient information distillation via a decoder, and (5) optional symmetric regularization.

2.1. Rotation-Invariant Local Feature Extraction

Given a point cloud $X = \{x_n \in \mathbb{R}^3\}_{n=1}^N$ (typically $N=2048$ ), each point $x_n$ 's local neighborhood $\mathcal{S}_n$ is established via Euclidean radius search. Covariance analysis and eigen-decomposition of $\mathcal{S}_n$ yield a local reference frame (LRF), ensuring invariance to rotation. Neighbors are re-expressed in this LRF and voxelized as smoothed density values (SDV) on a $W \times H \times D$ grid. The voxelized SDVs are processed by seven 3D convolutional layers with channel sizes $[32,32,64,64,128,128,128]$ , providing a robust, rotation-invariant descriptor $f(x_n)$ for subsequent processing.

2.2. Dual-Branch Keypoint Detector

The descriptor $f(x)$ feeds into two multi-layer perceptron (MLP) branches:

Saliency Probability $\Phi(x)$ : Outputs a probability in $[0,1]$ indicating the point's likelihood of being a keypoint.
Embedding $h(x) \in \mathbb{R}^F$ : Produces a high-dimensional embedding ( $F=128$ ), intended for downstream matching and description.

Shared layers process the feature vector from 512 to 256 dimensions, after which the branches split:

$\Phi$ -branch: $256 \rightarrow 1$
$h$ -branch: $256 \rightarrow 128$

This dual-branch design ensures both the spatial selection of keypoints and their descriptive utility.

2.3. GAN-Based Sparsity Control

Sparsity and bimodality in $\Phi(x)$ are imposed by aligning their histogram to a Beta prior $\mathrm{Beta}(\alpha, \beta)$ ( $\alpha=0.01,\; \beta=0.05$ ). A Wasserstein GAN with gradient penalty (WGAN-GP) formalism is employed: the "generator" is $\Phi$ itself, and a Conv1D-based critic $D$ (five layers: 512, 256, 128, 64, 1 + max pooling) discriminates between real (Beta-distributed) and generated ( $\Phi(x)$ ) keypoint probabilities. The corresponding objective is: $\mathcal{L}_{GAN} = \min_G \max_D\;\mathbb{E}_{p\sim\mathrm{Beta}(\alpha,\beta)}\left[ D(\{p\})\right] -\mathbb{E}_{x\sim X}\left[ D(\{\Phi(x)\})\right] +\lambda\, \mathbb{E}_{\hat x} \left( \|\nabla_{\hat x}D(\hat x)\|_2-1 \right)^2,$ with $\lambda=1$ . This setup robustly enforces control over keypoint sparsity without explicit parameter sharing.

2.4. Salient Information Distillation Decoder

Keypoint extraction is forced to be informative by recasting the full point set reconstruction as the training objective. Channel-wise max pooling selects salient features (both positive and negative directions) after weighting embeddings by $\Phi(x)$ : $v = \max_{x\in X}[\Phi(x)\cdot\max(h(x),0)] \;\oplus\; \max_{x\in X}[\Phi(x)\cdot\max(-h(x),0)]$ The resulting vector $v$ is provided to a TopNet-style tree decoder to reconstruct $\hat{X}$ . The reconstruction loss is the Chamfer distance: $\mathcal L_{rec} = \mathrm{CD}(\hat X, X)$ This design compels $\Phi(x)$ to select keypoints vital for representing the object's global geometry, demonstrating the tight coupling of saliency and embedding selection.

2.5. Symmetric Regularization

When object symmetries are available at training time, predictions for symmetric point pairs $(x, x')$ are regularized: $\mathcal L_{sym} = \frac{1}{|S|}\sum_{(x,x')\in S} \left(|\Phi(x)-\Phi(x')| + \|h(x) - h(x')\|_1\right)$ This approach ensures symmetry-consistent keypoint extraction, with the term active only during training.

2.6. Overall Training Objective

The loss function integrates all components: $\mathcal L = \eta_1\,\mathcal L_{rec} + \eta_2\,\mathcal L_{GAN} + \eta_3\,\mathcal L_{sym}$ Typical weights: $\eta_1=10$ , $\eta_2=1$ , $\eta_3=0.1$ (ShapeNet) or $\eta_3=0$ (SMPL), optimized using Adam with a learning rate of $1\text{e}{-4}$ .

3. Inference and Keypoint Representation

At inference, candidate keypoints are points with $\Phi(x) > \tau$ (default $\tau=0.5$ ). Non-maximum suppression (NMS) over a geodesic radius $r$ yields the sparsest and most distinct subset, typically extracting a fixed number $K$ of keypoints. The use of LRF-based features ensures rotation invariance without explicit data augmentation or auxiliary objectives. Embeddings $h(x)$ can serve directly as local descriptors for downstream correspondence or matching tasks.

4. Empirical Performance and Comparative Results

UKPGAN's effectiveness is validated through extensive quantitative and qualitative evaluations.

4.1. Human Keypoint Alignment

On the ShapeNet-Chair keypoint set and KeypointNet (airplane, chair, table), UKPGAN achieves mean intersection-over-union (mIoU) scores surpassing both recent learning-based (USIP: 25–30%) and traditional (<10%) baselines, e.g., achieving 36.2% IoU for chairs. Visualizations reveal high semantic alignment with human annotations, particularly at object corners and edges.

4.2. Repeatability under Non-Rigid Deformation

SMPL body keypoint detection across non-rigid pose variations shows UKPGAN with IoU $\approx 66.6\%$ and consistency loss $1.2 \times 10^{-3}$ , outperforming USIP (23.9% / $4.6 \times 10^{-3}$ ), D3Feat (20.3% / $3.8 \times 10^{-3}$ ), and hand-crafted methods (<10%).

4.3. Generalization and Registration

Trained on synthetic ShapeNet data, UKPGAN generalizes to real scan datasets (3DMatch, ETH). When combined with PerfectMatch and D3Feat descriptors, UKPGAN keypoints yield higher feature matching and registration recall, especially at low keypoint budgets (e.g., at 250 points per fragment, UKPGAN recall is 63.6% vs ISS 37.9%).

4.4. Rotation Repeatability

Randomly rotated models from KeypointNet show near-perfect repeatability (98–100%) for UKPGAN keypoints, significantly above alternative methods (typically below 80–90% repeatability for 4–8 keypoints).

4.5. Ablation Findings

Omitting GAN sparsity control decreases IoU by ~30%. Removing salient distillation (using average instead of max) reduces IoU by 10–15%. Excluding LRF-based features drops repeatability to sub-20%. Symmetric regularization marginally boosts IoU and repeatability while enforcing geometric symmetry.

5. Implementation and Application Guidelines

Data Preparation and Training

Normalize object meshes or point clouds to the unit sphere and uniformly sample $N=2048$ points per instance.
Beta prior hyperparameters ( $\alpha$ , $\beta$ ) adjust sparsity/bimodality of keypoint selection.
Supply symmetric point pairs for $\mathcal L_{sym}$ if available.

Software Use

Clone the repository: https://github.com/qq456cvb/UKPGAN
Dependencies: PyTorch, NumPy, etc.
Data preprocessing: scripts/prep_data.sh for SDV voxelization and LRF features.

Training:

1	python train.py --data_dir ./ShapeNet --alpha 0.01 --beta 0.05 --lr 1e-4 --eta1 10 --eta2 1 --eta3 0.1

Inference:

1	python detect_keypoints.py --model_path best.pth --threshold 0.5 --nms_radius 0.1 --num_keypoints 100

Integration into Geometric Pipelines

Registration: Detect keypoints, compute $h(x)$ or external descriptors, and match using nearest neighbor search with RANSAC.
Shape Correspondence: Use thresholded keypoints and their embeddings as anchors for functional maps or spectral alignment.
Pose Estimation: Combine UKPGAN keypoints with rigid alignment solvers such as Umeyama (1991).

6. Limitations and Interpretive Notes

While UKPGAN demonstrates strong performance in both synthetic and real domains, its reliance on clean object collections for training may limit initial generalization if input data is noisy or incomplete. The need to compute local reference frames and voxelized SDVs can increase preprocessing overhead. The selection of $\alpha$ and $\beta$ for the Beta prior is essential for balancing sparsity and coverage; smaller values force more bi-modal distributions and greater sparsity.

Symmetric regularization requires known symmetry correspondences; its impact is limited on categories lacking explicit symmetries.

7. Conclusion

UKPGAN establishes a general self-supervised framework for detecting sparse, semantically aligned, and geometrically robust 3D keypoints via an information-compression approach. Its core modules—GAN-based sparsity control and salient information distillation—support high repeatability under diverse transformations and high alignment with human-annotated ground truth. UKPGAN-trained models generalize effectively across synthetic and real environments, providing valuable primitives for geometric registration, correspondence, and pose estimation in both rigid and non-rigid settings.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to UKPGAN.