Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation (2303.11516v2)

Published 21 Mar 2023 in cs.CV

Abstract: Most modern image-based 6D object pose estimation methods learn to predict 2D-3D correspondences, from which the pose can be obtained using a PnP solver. Because of the non-differentiable nature of common PnP solvers, these methods are supervised via the individual correspondences. To address this, several methods have designed differentiable PnP strategies, thus imposing supervision on the pose obtained after the PnP step. Here, we argue that this conflicts with the averaging nature of the PnP problem, leading to gradients that may encourage the network to degrade the accuracy of individual correspondences. To address this, we derive a loss function that exploits the ground truth pose before solving the PnP problem. Specifically, we linearize the PnP solver around the ground-truth pose and compute the covariance of the resulting pose distribution. We then define our loss based on the diagonal covariance elements, which entails considering the final pose estimate yet not suffering from the PnP averaging issue. Our experiments show that our loss consistently improves the pose estimation accuracy for both dense and sparse correspondence based methods, achieving state-of-the-art results on both Linemod-Occluded and YCB-Video.

References (54)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a novel Linear-Covariance (LC) loss that linearly approximates the PnP solver around the ground truth to directly supervise 6D pose estimation.
It combines covariance, prior, and linear loss terms within a Laplace NLL framework, ensuring accurate and consistent gradient signals during training.
Experiments on LM-O and YCB-V datasets demonstrate that integrating LC loss improves both dense and sparse correspondence methods, achieving state-of-the-art performance.

The paper "Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation" (Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation, 2023) addresses a core challenge in geometry-driven 6D object pose estimation pipelines: the difficulty of training the initial correspondence prediction network using a differentiable loss function that directly supervises the final 6D pose.

Most geometry-driven methods predict 2D-3D correspondences (e.g., 2D image points and their corresponding 3D object points) and then use a non-differentiable Perspective-n-Points (PnP) solver to estimate the 6D pose. Training typically relies on supervising the intermediate correspondences, which doesn't guarantee optimal performance on the final pose estimation task. While recent works have introduced differentiable PnP layers to enable end-to-end training with pose-driven losses, the authors argue that these methods still suffer from the "averaging nature" of PnP solvers. This averaging effect, when using more than the minimal number of correspondences (typically 4), can lead to gradients that might inadvertently encourage the network to degrade the accuracy of some individual correspondences, providing conflicting training signals.

The proposed solution is the Linear-Covariance (LC) Loss. Instead of solving the PnP problem and then computing a loss based on the resulting pose, the LC loss leverages the ground-truth pose before solving PnP. It linearizes the PnP solver around the ground-truth pose and computes the covariance matrix of the resulting pose distribution. This covariance matrix, derived from the residuals of the predicted correspondences with respect to the ground-truth pose, reflects the uncertainty of the pose estimate that would be obtained if the linearized PnP was used. The LC loss is then defined based on the diagonal elements of this covariance matrix, which represent the squared errors of the pose parameters. By minimizing this loss, the network is encouraged to predict correspondences that, when fed into a PnP-like process linearized around the ground truth, result in a pose estimate with low uncertainty.

The LC loss function is formulated based on a Laplace Negative Log Likelihood (NLL) framework. It includes three components:

$E_{cov}$ : A covariance loss (square root of the sum of diagonal covariance elements) that directly measures the uncertainty of the pose estimate given the predicted correspondences.
$E_{prior}$ : A prior loss based on the inverse Hessian of the PnP NLL at the ground-truth pose. This encourages the predicted weights to act as priors on the reprojection errors.
$E_{linear}$ : A linear loss based on the squared difference between the ground-truth pose and the pose estimated by the linearized PnP solver around the ground truth. This term encourages the correspondences and weights to support an accurate pose estimate.

The total LC loss is a combination of these terms, framed within the Laplace NLL structure: $L_{LC}=\log(E_{prior})+0.5\cdot\frac{E_{cov}+E_{linear}}{E_{prior}}$ . This formulation seeks to minimize the covariance and linear errors while encouraging the prior error (derived from weights) to predict this combined error. Crucially, the gradients of the loss are computed only with respect to the correspondence residuals and weights, which are outputs of the network, while treating the linearized PnP components (like matrix A and the Hessian) as constants derived from the ground truth and predicted geometry. This avoids backpropagating through the potentially problematic full PnP solve.

For practical implementation, the network architecture follows a standard encoder-decoder structure operating on cropped object regions (Figure 3 in the paper). The first stage network predicts dense 3D coordinates and weights (for dense methods like GDR-Net) or sparse keypoint heatmaps and associated standard deviations (for sparse methods). For dense prediction, visibility masks are used for correspondence selection. For sparse prediction, the inverse of the predicted standard deviations is used as weights. The authors also adapt the ZebraPose method's binary vertex encoding to be coordinate-wise, making it differentiable without a lookup table.

The LC loss can be computed efficiently by performing the covariance calculation in a compact 6D pose representation and then transforming it to the target representation (e.g., 8 bounding box corners). The gradients are clipped to handle potential outliers or unstable cases early in training, and the Huber function is used for robustness. The LC loss is added to the existing base loss functions of the baseline methods (GDR-Net and ZebraPose).

The method is evaluated on the Linemod-Occluded (LM-O) and YCB-Video (YCB-V) datasets using real and physically-based rendering (PBR) synthetic training data. Standard metrics like ADD(-S) and AUC of ADD(-S) are used.

Experimental results show that applying the LC loss consistently improves the performance of both dense (GDR-Net) and sparse (a GDR-Net variant) correspondence-based methods compared to their baselines. Notably, integrating the LC loss with ZebraPose, a state-of-the-art method, achieves new state-of-the-art results on both LM-O and YCB-V (Tables 1, 2, and supplementary tables).

A key advantage highlighted is the "gradient correctness". The authors show that traditional differentiable PnP losses (BPnP, EPro-PnP) can produce incorrect gradients for some correspondences, especially pixels near or outside the object boundary, leading to inconsistent supervision. The LC loss, by linearizing around the ground truth, maintains a correctness close to 100% throughout training, providing more consistent guidance and enabling the network to learn robust correspondences even in extrapolated regions (Figure 5). This consistency also contributes to faster training runtime compared to iterative differentiable PnP solvers.

Ablation studies demonstrate the importance of each component of the LC loss ( $E_{cov}$ , $E_{linear}$ , $E_{prior}$ ) for achieving optimal performance (Table 4). Detaching residuals or weights from $E_{cov}$ , removing $E_{linear}$ , or removing $E_{prior}$ all lead to performance drops and qualitatively impact the learned correspondence and weight maps (Figure 6). The studies also show that the LC loss is effective with both 3D-space and 2D-space pose representations, allowing adaptation to different application needs (e.g., robotics requiring 3D accuracy vs. AR requiring 2D reprojection accuracy). Furthermore, using the Laplace NLL formulation was found to be significantly more effective than a Gaussian formulation (Table 6).

In summary, the Linear-Covariance loss is a novel pose-driven loss for end-to-end training of 6D object pose estimation networks that utilize PnP solvers. By linearizing the PnP problem around the ground truth, it avoids the issues of directly backpropagating through an iterative solver's averaging effect. The loss, based on the estimated pose covariance, provides consistent gradient signals that encourage the network to predict correspondences and weights leading to a low-uncertainty pose estimate. This approach leads to improved pose accuracy and state-of-the-art results when integrated into existing pipelines. A limitation is that it requires an initial signal for correspondence learning, so it complements, rather than replaces, existing correspondence-level supervision. Future work includes applying the LC loss to category-level pose estimation where precise object models are not available.

PDF Markdown

Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation (2303.11516v2)

Summary

Related Papers