PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation (1812.11788v1)

Published 31 Dec 2018 in cs.CV

Abstract: This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise unit vectors pointing to the keypoints and use these vectors to vote for keypoint locations using RANSAC. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large margin, while being efficient for real-time pose estimation. We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation. The code will be avaliable at https://zju-3dv.github.io/pvnet/.

Authors (5)

Sida Peng (70 papers)
Yuan Liu (342 papers)
Qixing Huang (78 papers)
Hujun Bao (134 papers)
Xiaowei Zhou (122 papers)

Citations (828)

View on Semantic Scholar

Summary

Overview of "PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation"

The paper "PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation" introduces a novel methodology for object pose estimation from a single RGB image, particularly designed to address challenges related to occlusion and truncation. Intended for applications demanding precise orientation and translation detection in 3D space, such as robotics and augmented reality, the proposed method offers innovative contributions that enhance robustness and accuracy over existing techniques.

Key Contributions

Pixel-wise Voting Network (PVNet): The core of the proposed approach is PVNet, which departs from traditional keypoint regression techniques. PVNet outputs pixel-wise unit vectors pointing towards keypoints and employs a RANSAC-based scheme for keypoint localization. This methodology effectively addresses the issues associated with occlusion and truncation by leveraging local features and spatial relations between object parts.
Uncertainty-driven PnP Solver: Another important contribution is the uncertainty-driven PnP solver, which minimizes the Mahalanobis distance considering the covariance of keypoint distributions. This approach enhances the accuracy of the pose estimation, particularly in scenarios with noisy or uncertain keypoint predictions.
Datasets and Evaluation: The proposed PVNet demonstrates superior performance on several benchmark datasets, namely LINEMOD, Occlusion LINEMOD, and YCB-Video. Furthermore, the authors introduce a new Truncation LINEMOD dataset to specifically test robustness against truncated objects. Across these datasets, PVNet exhibits significant improvements over state-of-the-art methods.

Methodology

PVNet Architecture

PVNet employs a fully convolutional network, with a modified ResNet-18 backbone to output both semantic labels and pixel-wise unit vectors pointing to different keypoints of an object. The keypoint localization mechanism involves generating hypotheses through vector intersections and tallying votes via a RANSAC-based scheme. This dense prediction and voting technique ensures robustness to occlusion, as even non-visible keypoints can be inferred from their visible counterparts.

Keypoint Selection

Unlike methods using pre-defined bounding box corners, PVNet selects keypoints on the surface of the object, ensuring a diverse yet invariant set of points that improve pose estimation stability. The farthest point sampling (FPS) algorithm is employed to choose these keypoints, balancing coverage and detection precision.

Uncertainty-driven PnP

The RANSAC-based scheme computes spatial probability distributions (means and covariances) for keypoints. Utilizing these distributions, the pose is estimated by minimizing reprojection errors in a PnP solver, further refined using Levenberg-Marquardt optimization, ensuring that the uncertainties in keypoint localization are adequately considered.

Experimental Results

Experiments conducted on LINEMOD, Occlusion LINEMOD, and YCB-Video datasets affirm the effectiveness of PVNet. Notably, in LINEMOD, PVNet surpasses other methods (such as BB8 and Tekin) in both 2D projection and ADD(-S) metrics. In Occlusion LINEMOD, it evidently demonstrates higher robustness in occlusion scenarios, and importantly, it does so without additional refinement stages, holding an advantage in efficiency and simplicity.

The introduction of the Truncation LINEMOD dataset further validates the method's resilience, achieving substantial performance even when significant portions of the objects are missing from the input image. Across all datasets, the efficiency of PVNet allows real-time applications, processing images at 25 fps on a GTX 1080 Ti GPU.

Implications and Future Directions

The practical implications of PVNet are considerable in fields requiring reliable pose estimations under complex conditions such as occlusions and truncations. The dense keypoint prediction approach may be extended to other forms of data (e.g., video sequences), potentially integrating temporal coherence for even more robust tracking.

In theoretical aspects, the vector field representation and the uncertainty-driven approach push the boundaries in keypoint localization methodologies. Future research could explore further refinement in uncertainty modeling, or even integrate higher-order geometric constraints to improve robustness and accuracy.

Overall, the PVNet framework represents a meaningful advancement in 6DoF pose estimation, providing a substantial basis for further innovations in computer vision and real-time pose estimation applications.

PDF Markdown