Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation (2303.11516v2)

Published 21 Mar 2023 in cs.CV

Abstract: Most modern image-based 6D object pose estimation methods learn to predict 2D-3D correspondences, from which the pose can be obtained using a PnP solver. Because of the non-differentiable nature of common PnP solvers, these methods are supervised via the individual correspondences. To address this, several methods have designed differentiable PnP strategies, thus imposing supervision on the pose obtained after the PnP step. Here, we argue that this conflicts with the averaging nature of the PnP problem, leading to gradients that may encourage the network to degrade the accuracy of individual correspondences. To address this, we derive a loss function that exploits the ground truth pose before solving the PnP problem. Specifically, we linearize the PnP solver around the ground-truth pose and compute the covariance of the resulting pose distribution. We then define our loss based on the diagonal covariance elements, which entails considering the final pose estimate yet not suffering from the PnP averaging issue. Our experiments show that our loss consistently improves the pose estimation accuracy for both dense and sparse correspondence based methods, achieving state-of-the-art results on both Linemod-Occluded and YCB-Video.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Silhonet: An rgb method for 6d object pose estimation. IEEE Robotics and Automation Letters, 4(4):3727–3734, 2019.
  2. Learning 6d object pose estimation using 3d object coordinates. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 536–551, Cham, 2014. Springer International Publishing.
  3. Dsac - differentiable ransac for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  4. Learning less is more - 6d camera localization via 3d surface regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  5. Occlusion-robust object pose estimation with holistic representation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2929–2939, January 2022.
  6. End-to-end learnable geometric vision by backpropagating pnp optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  7. Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2781–2790, June 2022.
  8. Fs-net: Fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1581–1590, June 2021.
  9. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  10. Robust 3d object tracking from monocular images using stable parts. IEEE transactions on pattern analysis and machine intelligence, 40(6):1465–1479, 2017.
  11. Blenderproc2: A procedural pipeline for photorealistic rendering. Journal of Open Source Software, 8(82):4901, 2023.
  12. So-pose: Exploiting self-occlusion for direct 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12396–12405, October 2021.
  13. Gpv-pose: Category-level object pose estimation via geometry-guided point-wise voting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6781–6791, June 2022.
  14. Deep-6dpose: Recovering 6d object pose from a single rgb image, 2018.
  15. Lienet: Real-time monocular object instance 6d pose estimation. In British Machine Vision Conference, 2018.
  16. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  17. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
  18. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision, pages 548–562. Springer, 2012.
  19. Epos: Estimating 6d pose of objects with symmetries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  20. Bop: Benchmark for 6d object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  21. ipose: Instance-aware 6d pose estimation of partly occluded objects. In C. V. Jawahar, Hongdong Li, Greg Mori, and Konrad Schindler, editors, Computer Vision – ACCV 2018, pages 477–492, Cham, 2019. Springer International Publishing.
  22. Single-stage 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  23. Segmentation-driven 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  24. Peter J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73 – 101, 1964.
  25. Repose: Fast 6d object pose refinement via deep texture rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3303–3312, October 2021.
  26. Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  27. The implicit function theorem: history, theory, and applications. Springer Science & Business Media, 2002.
  28. Ep n p: An accurate o (n) solution to the p n p problem. International journal of computer vision, 81:155–166, 2009.
  29. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  30. Cdpn: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  31. Deep model-based 6d pose refinement in rgb. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  32. Pose estimation for augmented reality: a hands-on survey. IEEE transactions on visualization and computer graphics, 22(12):2633–2651, 2015.
  33. Symmetry and uncertainty-aware object slam for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14901–14910, June 2022.
  34. Numerical coordinate regression with convolutional neural networks. arXiv preprint arXiv:1801.07372, 2018.
  35. Making deep heatmaps robust to partial occlusions for 3d object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  36. Dprost: Dynamic projective spatial transformer network for 6d pose estimation. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 363–379, Cham, 2022. Springer Nature Switzerland.
  37. Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  38. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  39. 6-dof object pose from semantic keypoints. In 2017 IEEE international conference on robotics and automation (ICRA), pages 2011–2018. IEEE, 2017.
  40. Yudi Pawitan. In all likelihood: statistical modelling and inference using likelihood. Oxford University Press, 2001.
  41. Pvnet: Pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  42. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  43. Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6738–6748, June 2022.
  44. Real-time seamless single shot 6d object pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  45. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  46. Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16611–16621, June 2021.
  47. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  48. 6d-vnet: End-to-end 6-dof vehicle pose estimation from monocular rgb images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
  49. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. 2018.
  50. Rnnpose: Recurrent 6-dof object pose refinement with robust correspondence field estimation and pose optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14880–14890, June 2022.
  51. Dpod: 6d pose object detector and refiner. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  52. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  53. Single image 3d object detection and pose estimation for grasping. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 3936–3943. IEEE, 2014.
  54. Craves: Controlling robotic arm with a vision-based economic system. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
Citations (4)

Summary

  • The paper introduces a novel Linear-Covariance (LC) loss that linearly approximates the PnP solver around the ground truth to directly supervise 6D pose estimation.
  • It combines covariance, prior, and linear loss terms within a Laplace NLL framework, ensuring accurate and consistent gradient signals during training.
  • Experiments on LM-O and YCB-V datasets demonstrate that integrating LC loss improves both dense and sparse correspondence methods, achieving state-of-the-art performance.

The paper "Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation" (Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation, 2023) addresses a core challenge in geometry-driven 6D object pose estimation pipelines: the difficulty of training the initial correspondence prediction network using a differentiable loss function that directly supervises the final 6D pose.

Most geometry-driven methods predict 2D-3D correspondences (e.g., 2D image points and their corresponding 3D object points) and then use a non-differentiable Perspective-n-Points (PnP) solver to estimate the 6D pose. Training typically relies on supervising the intermediate correspondences, which doesn't guarantee optimal performance on the final pose estimation task. While recent works have introduced differentiable PnP layers to enable end-to-end training with pose-driven losses, the authors argue that these methods still suffer from the "averaging nature" of PnP solvers. This averaging effect, when using more than the minimal number of correspondences (typically 4), can lead to gradients that might inadvertently encourage the network to degrade the accuracy of some individual correspondences, providing conflicting training signals.

The proposed solution is the Linear-Covariance (LC) Loss. Instead of solving the PnP problem and then computing a loss based on the resulting pose, the LC loss leverages the ground-truth pose before solving PnP. It linearizes the PnP solver around the ground-truth pose and computes the covariance matrix of the resulting pose distribution. This covariance matrix, derived from the residuals of the predicted correspondences with respect to the ground-truth pose, reflects the uncertainty of the pose estimate that would be obtained if the linearized PnP was used. The LC loss is then defined based on the diagonal elements of this covariance matrix, which represent the squared errors of the pose parameters. By minimizing this loss, the network is encouraged to predict correspondences that, when fed into a PnP-like process linearized around the ground truth, result in a pose estimate with low uncertainty.

The LC loss function is formulated based on a Laplace Negative Log Likelihood (NLL) framework. It includes three components:

  1. EcovE_{cov}: A covariance loss (square root of the sum of diagonal covariance elements) that directly measures the uncertainty of the pose estimate given the predicted correspondences.
  2. EpriorE_{prior}: A prior loss based on the inverse Hessian of the PnP NLL at the ground-truth pose. This encourages the predicted weights to act as priors on the reprojection errors.
  3. ElinearE_{linear}: A linear loss based on the squared difference between the ground-truth pose and the pose estimated by the linearized PnP solver around the ground truth. This term encourages the correspondences and weights to support an accurate pose estimate.

The total LC loss is a combination of these terms, framed within the Laplace NLL structure: LLC=log(Eprior)+0.5Ecov+ElinearEpriorL_{LC}=\log(E_{prior})+0.5\cdot\frac{E_{cov}+E_{linear}}{E_{prior}}. This formulation seeks to minimize the covariance and linear errors while encouraging the prior error (derived from weights) to predict this combined error. Crucially, the gradients of the loss are computed only with respect to the correspondence residuals and weights, which are outputs of the network, while treating the linearized PnP components (like matrix A and the Hessian) as constants derived from the ground truth and predicted geometry. This avoids backpropagating through the potentially problematic full PnP solve.

For practical implementation, the network architecture follows a standard encoder-decoder structure operating on cropped object regions (Figure 3 in the paper). The first stage network predicts dense 3D coordinates and weights (for dense methods like GDR-Net) or sparse keypoint heatmaps and associated standard deviations (for sparse methods). For dense prediction, visibility masks are used for correspondence selection. For sparse prediction, the inverse of the predicted standard deviations is used as weights. The authors also adapt the ZebraPose method's binary vertex encoding to be coordinate-wise, making it differentiable without a lookup table.

The LC loss can be computed efficiently by performing the covariance calculation in a compact 6D pose representation and then transforming it to the target representation (e.g., 8 bounding box corners). The gradients are clipped to handle potential outliers or unstable cases early in training, and the Huber function is used for robustness. The LC loss is added to the existing base loss functions of the baseline methods (GDR-Net and ZebraPose).

The method is evaluated on the Linemod-Occluded (LM-O) and YCB-Video (YCB-V) datasets using real and physically-based rendering (PBR) synthetic training data. Standard metrics like ADD(-S) and AUC of ADD(-S) are used.

Experimental results show that applying the LC loss consistently improves the performance of both dense (GDR-Net) and sparse (a GDR-Net variant) correspondence-based methods compared to their baselines. Notably, integrating the LC loss with ZebraPose, a state-of-the-art method, achieves new state-of-the-art results on both LM-O and YCB-V (Tables 1, 2, and supplementary tables).

A key advantage highlighted is the "gradient correctness". The authors show that traditional differentiable PnP losses (BPnP, EPro-PnP) can produce incorrect gradients for some correspondences, especially pixels near or outside the object boundary, leading to inconsistent supervision. The LC loss, by linearizing around the ground truth, maintains a correctness close to 100% throughout training, providing more consistent guidance and enabling the network to learn robust correspondences even in extrapolated regions (Figure 5). This consistency also contributes to faster training runtime compared to iterative differentiable PnP solvers.

Ablation studies demonstrate the importance of each component of the LC loss (EcovE_{cov}, ElinearE_{linear}, EpriorE_{prior}) for achieving optimal performance (Table 4). Detaching residuals or weights from EcovE_{cov}, removing ElinearE_{linear}, or removing EpriorE_{prior} all lead to performance drops and qualitatively impact the learned correspondence and weight maps (Figure 6). The studies also show that the LC loss is effective with both 3D-space and 2D-space pose representations, allowing adaptation to different application needs (e.g., robotics requiring 3D accuracy vs. AR requiring 2D reprojection accuracy). Furthermore, using the Laplace NLL formulation was found to be significantly more effective than a Gaussian formulation (Table 6).

In summary, the Linear-Covariance loss is a novel pose-driven loss for end-to-end training of 6D object pose estimation networks that utilize PnP solvers. By linearizing the PnP problem around the ground truth, it avoids the issues of directly backpropagating through an iterative solver's averaging effect. The loss, based on the estimated pose covariance, provides consistent gradient signals that encourage the network to predict correspondences and weights leading to a low-uncertainty pose estimate. This approach leads to improved pose accuracy and state-of-the-art results when integrated into existing pipelines. A limitation is that it requires an initial signal for correspondence learning, so it complements, rather than replaces, existing correspondence-level supervision. Future work includes applying the LC loss to category-level pose estimation where precise object models are not available.