DLTPose: 6DoF Pose Estimation From Accurate Dense Surface Point Estimates (2504.07335v1)

Published 9 Apr 2025 in cs.CV

Abstract: We propose DLTPose, a novel method for 6DoF object pose estimation from RGB-D images that combines the accuracy of sparse keypoint methods with the robustness of dense pixel-wise predictions. DLTPose predicts per-pixel radial distances to a set of minimally four keypoints, which are then fed into our novel Direct Linear Transform (DLT) formulation to produce accurate 3D object frame surface estimates, leading to better 6DoF pose estimation. Additionally, we introduce a novel symmetry-aware keypoint ordering approach, designed to handle object symmetries that otherwise cause inconsistencies in keypoint assignments. Previous keypoint-based methods relied on fixed keypoint orderings, which failed to account for the multiple valid configurations exhibited by symmetric objects, which our ordering approach exploits to enhance the model's ability to learn stable keypoint representations. Extensive experiments on the benchmark LINEMOD, Occlusion LINEMOD and YCB-Video datasets show that DLTPose outperforms existing methods, especially for symmetric and occluded objects, demonstrating superior Mean Average Recall values of 86.5% (LM), 79.7% (LM-O) and 89.5% (YCB-V). The code is available at https://anonymous.4open.science/r/DLTPose_/ .

Summary

The paper introduces a novel DLT formulation that computes dense 3D object surface points using per-pixel radial distances to predefined keypoints.
It employs a symmetry-aware keypoint ordering mechanism that dynamically adjusts based on object orientation to reduce regression errors.
It achieves state-of-the-art performance on benchmarks like LINEMOD and YCB-Video, with Mean Average Recall scores up to 89.5%.

This paper introduces DLTPose, a novel method for 6DoF object pose estimation from RGBD images that aims to combine the accuracy benefits of sparse keypoint methods with the robustness of dense prediction approaches (2504.07335). The core idea is to predict dense, per-pixel radial distances to a predefined set of at least four object keypoints. These radial distances, along with the known 3D coordinates of the keypoints in the object's reference frame, are then used in a novel Direct Linear Transform (DLT) formulation to estimate the 3D coordinates of the visible object surface points directly in the object frame.

The key contributions are:

Novel DLT Formulation: Instead of regressing object coordinates directly or voting for keypoint locations, DLTPose estimates per-pixel radial distances ( $\hat{r}_j$ ) to known object keypoints ( $\overline{k}_j$ ). The relationship $r_j = ||\overline{p} - \overline{k}_j||$ , where $\overline{p}$ is the unknown object surface point, is expanded and rearranged into the linear form $A\overline{X}=0$ . Here, $A$ is constructed using the known keypoint coordinates $\overline{k}_j$ and the estimated radial distances $\hat{r}_j$ , and $\overline{X}$ contains the unknown object coordinates $\overline{p}=(\bar{x}, \bar{y}, \bar{z})$ and its squared norm $||\overline{p}||^2$ . With at least four non-coplanar keypoints, this system can be solved using Singular Value Decomposition (SVD) to find $\overline{p}$ for each pixel, effectively estimating the object's visible surface in its own coordinate frame. This process is analogous to finding the intersection point of multiple spheres.
Symmetry-Aware Keypoint Framework: To handle object symmetries, which cause ambiguities in keypoint assignments for traditional fixed-order methods, DLTPose introduces a dynamic keypoint ordering approach. Symmetric keypoints are first generated based on the object's Oriented Bounding Box (OBB). During training and inference, the order of the radial map channels (corresponding to these keypoints) is dynamically determined based on the keypoints' relative proximity to the camera origin. This ensures a consistent representation for the network to learn, regardless of the object's symmetric orientation, reducing regression errors.
State-of-the-Art Performance: The method achieves superior results on standard benchmarks like LINEMOD (LM), Occlusion LINEMOD (LM-O), and YCB-Video (YCB-V), outperforming previous methods especially on symmetric and occluded objects. It reports Mean Average Recall (AR) scores of 86.5% (LM), 79.7% (LM-O), and 89.5% (YCB-V).

Implementation Details:

Network Architecture: Uses a ResNet-152 backbone, similar to PVNet but with ReLU activations and more skip connections. It takes segmented RGBD images as input and outputs an $N_k$ -channel radial map $\hat{R}$ .
Loss Function: A weighted sum of three losses:
- $\mathcal{L}_R$ : Mean absolute error between predicted and ground truth radial distances.
- $\mathcal{L}_C$ : Soft L1 loss for regressing normalized object coordinates (NOCS), aiding geometric understanding.
- $\mathcal{L}_P$ : A pseudo-symmetric loss based on discretized NOCS coordinates to handle near-symmetries. For fully symmetric objects, only $\mathcal{L}_R$ is used along with the symmetry-aware keypoint ordering.
Pose Estimation Pipeline:

1. Segment the object using Mask R-CNN. 2. Feed the segmented RGBD image to the DLTPose network to get the estimated radial map $\hat{\mathbf{R}}_{\hat{S}}$ . 3. Use the DLT formulation with $\hat{\mathbf{R}}_{\hat{S}}$ and known object keypoints $\overline{k}_j$ to compute object frame surface points $\widehat{\overline{\mathbf{X}}}$ . 4. Each estimated object point $\hat{\bar{p}}_i \in \widehat{\overline{\mathbf{X}}}$ has a corresponding camera frame point $p_i$ (derived from the input depth map $\mathbf{I}_{\hat{S}^D}$ ). 5. Use a RANSAC-based Umeyama algorithm to align $\widehat{\overline{\mathbf{X}}}$ and $\mathbf{I}_{\hat{S}^D}$ to get the 6DoF pose $[\boldsymbol{\mathcal{R}|t}]$ . 6. Optionally refine the pose using ICP.

Evaluation:

Experiments show that the DLT-based surface estimation produces more accurate and denser point clouds compared to methods like SurfEmb.
Adding noise to the estimated surface points directly degrades pose estimation accuracy, highlighting the importance of accurate surface estimation.
Ablation studies confirm the benefits of the symmetry-aware keypoint framework (improving ADD-S and AR for symmetric objects compared to using KeyGNet keypoints) and the pseudo-symmetric loss term (improving ADD(-S) for LM-O objects).

In conclusion, DLTPose leverages a novel DLT formulation to accurately estimate dense object surface points from predicted radial distances, improving 6DoF pose estimation. Its symmetry-aware keypoint ordering further enhances robustness for symmetric objects, leading to state-of-the-art results on challenging benchmarks.