EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation (2203.13254v4)

Published 24 Mar 2022 in cs.CV

Abstract: Locating 3D objects from a single RGB image via Perspective-n-Points (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest interpreting PnP as a differentiable layer, so that 2D-3D point correspondences can be partly learned by backpropagating the gradient w.r.t. object pose. Yet, learning the entire set of unrestricted 2D-3D points from scratch fails to converge with existing approaches, since the deterministic pose is inherently non-differentiable. In this paper, we propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation, which outputs a distribution of pose on the SE(3) manifold, essentially bringing categorical Softmax to the continuous domain. The 2D-3D coordinates and corresponding weights are treated as intermediate variables learned by minimizing the KL divergence between the predicted and target pose distribution. The underlying principle unifies the existing approaches and resembles the attention mechanism. EPro-PnP significantly outperforms competitive baselines, closing the gap between PnP-based method and the task-specific leaders on the LineMOD 6DoF pose estimation and nuScenes 3D object detection benchmarks.

Summary

The paper introduces a novel differentiable probabilistic pose formulation that overcomes non-differentiability in traditional PnP solvers.
It employs KL divergence minimization and an Adaptive Multiple Importance Sampling strategy to efficiently backpropagate pose errors.
Integration into dense and deformable correspondence networks demonstrates state-of-the-art performance on LineMOD and nuScenes.

The paper "EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation" (EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation, 2022) introduces a novel approach to address the challenge of training deep learning models end-to-end for monocular object pose estimation based on the Perspective-n-Points (PnP) algorithm. The core problem is that the standard PnP solver, which finds the optimal object pose from 2D-3D point correspondences, is inherently non-differentiable at certain points (e.g., due to pose ambiguity), hindering direct gradient backpropagation from a pose-level loss. Existing end-to-end methods typically learn only parts of the correspondence (e.g., 2D points, 3D points, or weights) or rely heavily on surrogate losses or regularization, which can limit performance and generalization.

EPro-PnP proposes interpreting the output of the PnP process not as a single deterministic pose, but as a probabilistic distribution over the SE(3) manifold (the space of rigid body transformations). This probabilistic formulation yields a probability density function for pose, which is differentiable, analogous to how the Softmax function provides a differentiable approximation for the discrete argmax operation in classification.

The paper defines the likelihood of observing the given 2D-3D correspondences $X$ for a specific pose $y$ based on the negative exponential of the cumulative weighted squared reprojection errors. Using Bayes' theorem with an uninformative prior, they derive a posterior pose distribution $p(y|X)$ , which essentially corresponds to a normalized likelihood function.

Training is performed by minimizing the Kullback-Leibler (KL) divergence between this predicted pose distribution $p(y|X)$ and a target pose distribution $t(y)$ , typically set as a narrow distribution centered at the ground truth pose $y_\text{gt}$ . The resulting KL loss $L_\text{KL}$ consists of two main terms: the reprojection error at the ground truth pose and the logarithm of an integral over the predicted pose distribution $p(y|X)$ .

The main technical challenge is computing the integral term in $L_\text{KL}$ . EPro-PnP tackles this using an efficient Monte Carlo approach based on the Adaptive Multiple Importance Sampling (AMIS) algorithm. AMIS iteratively refines a proposal distribution to better approximate the target integrand, allowing for efficient sampling of poses from regions of high likelihood. The backpropagation then occurs through the sampled poses and their computed importance weights.

The gradients derived from this KL loss reveal an intuitive mechanism for learning the 2D-3D correspondence weights. The update rule for weights balances two factors: reducing reprojection error at the ground truth (down-weighting uncertain correspondences) and increasing the expected reprojection error variance over the predicted pose distribution (up-weighting discriminative correspondences that are sensitive to pose changes). This mechanism resembles the attention mechanism, focusing on reliable and informative point pairs.

Additionally, EPro-PnP includes a derivative regularization loss $L_\text{reg}$ which encourages the local step of an iterative PnP solver (like Gauss-Newton or Levenberg-Marquardt) from a locally optimal pose towards the ground truth pose. This helps regularize the derivatives around the PnP solution, improving stability, particularly when combined with coordinate regression losses.

The paper demonstrates the generality and flexibility of EPro-PnP by integrating it into two distinct network architectures:

Dense Correspondence Network: An adaptation of the CDPN [CDPN] framework for 6DoF pose estimation. The translation head is removed, and the network predicts dense pixel-wise 2D-3D correspondences and corresponding weights normalized by spatial Softmax. Trained on the LineMOD dataset [linemod], this setup shows significant improvement over the CDPN baseline when converted to a pure PnP approach, achieving state-of-the-art results among geometric methods.
Deformable Correspondence Network: A novel architecture for 3D object detection on the nuScenes dataset [nuscenes], built upon the FCOS3D [fcos3d] framework. Inspired by Deformable DETR [deformabledetr], this network learns sparse, deformable 2D-3D correspondences from scratch using deformable attention. The network predicts object-level properties (score, size, velocity, attribute) and point-level properties (3D coordinates in Normalized Object Coordinate space, weights). This approach demonstrates that EPro-PnP can enable learning correspondences entirely from image data without relying on strong geometric priors like predefined 3D models, and outperforms direct pose prediction methods in terms of pose accuracy and overall detection score.

The experiments show that EPro-PnP successfully enables end-to-end learning for PnP-based methods. On LineMOD, it significantly improves the performance of a PnP-only variant of CDPN. On nuScenes, the deformable correspondence network trained with EPro-PnP achieves top-tier performance, particularly excelling in pose accuracy and demonstrating the ability to model pose ambiguity (e.g., for symmetric objects) through multimodal pose distributions.

Key practical implementation details include:

Using a Huber kernel for robustified reprojection errors in the PnP solver and Monte Carlo loss.
Implementing a custom PyTorch batch Levenberg-Marquardt solver for efficiency.
Using appropriate proposal distributions (multivariate t for translation, von Mises/ACG for rotation) for Monte Carlo sampling.
Employing random sampling initialization for the PnP solver, especially crucial for the deformable network and general cases.
Adding auxiliary losses (e.g., dense reprojection, coordinate regression) can further enhance performance, although EPro-PnP is capable of learning correspondences from scratch.

While training with Monte Carlo sampling adds computational overhead compared to simpler methods, the paper argues that the flexibility and performance gains justify this, and that runtime can be controlled by adjusting sampling parameters. The core principle of translating a nested, non-differentiable optimization problem into a differentiable probabilistic layer is presented as a general concept applicable potentially to other "declarative networks" [declarative].

PDF Markdown

Related Papers

GitHub

GitHub - tjiiv-cprg/EPro-PnP: [CVPR 2022 Oral, Best Student Paper] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation (1,144 stars)