DualPoseNet: Category-level 6D Object Pose and Size Estimation Using Dual Pose Network with Refined Learning of Pose Consistency (2103.06526v3)

Published 11 Mar 2021 in cs.CV

Abstract: Category-level 6D object pose and size estimation is to predict full pose configurations of rotation, translation, and size for object instances observed in single, arbitrary views of cluttered scenes. In this paper, we propose a new method of Dual Pose Network with refined learning of pose consistency for this task, shortened as DualPoseNet. DualPoseNet stacks two parallel pose decoders on top of a shared pose encoder, where the implicit decoder predicts object poses with a working mechanism different from that of the explicit one; they thus impose complementary supervision on the training of pose encoder. We construct the encoder based on spherical convolutions, and design a module of Spherical Fusion wherein for a better embedding of pose-sensitive features from the appearance and shape observations. Given no testing CAD models, it is the novel introduction of the implicit decoder that enables the refined pose prediction during testing, by enforcing the predicted pose consistency between the two decoders using a self-adaptive loss term. Thorough experiments on benchmarks of both category- and instance-level object pose datasets confirm efficacy of our designs. DualPoseNet outperforms existing methods with a large margin in the regime of high precision. Our code is released publicly at https://github.com/Gorilla-Lab-SCUT/DualPoseNet.

Citations (114)

View on Semantic Scholar

Summary

The paper presents a dual-decoder architecture that combines explicit and implicit predictions to refine 6D object pose and size estimates.
It leverages spherical convolutions and a self-adaptive loss to enforce pose consistency from single-view RGB-D inputs.
Extensive experiments on datasets like REAL275 demonstrate superior performance with significant mAP improvements over existing methods.

DualPoseNet: Category-level 6D Object Pose and Size Estimation

The paper presents DualPoseNet, an innovative approach for category-level 6D object pose and size estimation, using a novel architecture that involves a dual pose network with a refined learning process to ensure pose consistency. The core problem addressed is the estimation of full pose configurations—including rotation, translation, and size—for object instances observed from single arbitrary views in cluttered scenes, crucial for applications in augmented reality, robotics, and autonomous vehicles.

Methodology

Dual Pose Network Architecture

DualPoseNet integrates two pose decoders—a major technical innovation. The architecture consists of a shared pose encoder, built upon spherical convolutions for learning pose-sensitive features, and two distinct decoders:

Explicit Decoder: This directly predicts the rotation, translation, and size of the objects using an MLP.
Implicit Decoder: It reconstructs the canonical pose of the input point cloud, focusing on the consistent prediction of pose through a self-adaptive loss term. This implicit decoder aims to further refine pose predictions during testing by enforcing consistency with the explicit decoder, despite the absence of testing CAD models.

Spherical Convolutions and Fusion

The use of spherical convolutions ensures rotation equivariance, thereby effectively capturing pose-sensitive shape features from the RGB-D data. A Spherical Fusion module is embedded within the encoder, facilitating the integration of features derived from appearance and shape observations, specifically tuned to enhance the encoder's learning capabilities.

Results

Extensive experiments on both category- and instance-level object pose datasets were performed, including CAMERA25 and REAL275 for category-level tests, and YCB-Video and LineMOD for instance-level assessments. DualPoseNet demonstrates superior performance over existing methods, particularly in high precision metrics such as IoU $_{50}$ and IoU $_{75}$ .

Numerical Highlights:

On REAL275, DualPoseNet achieved an outstanding mAP of 44.5% at IoU $_{50}$ , 10° error threshold, and 10% scale error, significantly outperforming existing benchmarks.
Noteworthy improvements are also seen on benchmark synthetic datasets; for example, DualPoseNet outperformed prior methods with mAPs peaking at 86.4% when considering IoU $_{75}$ and translation/rotation thresholds.

Practical and Theoretical Implications

The introduction of a dual pose estimation mechanism represents an important advance, particularly in scenarios lacking CAD models for refinement post-processing. The method’s inclusion of self-adaptive loss to enforce consistency between the two decoders suggests a promising direction for optimizing pose predictions.

For real-world application, the ability to infer precise 6D poses without reliance on CAD models is pivotal for scalability and practical deployment, particularly in domains requiring rapid and adaptable object detection solutions.

Future Developments

The research paves the way for further advancements in 6D pose estimation by exploring alternatives to spherical convolutions and exploring broader applications of the proposed architecture in more diverse and complex environments. An interesting avenue for future exploration lies in extending the method to monocular inputs exclusively, potentially increasing its utility across a wider range of real-world applications.

Overall, DualPoseNet embodies a significant step forward in category-level 6D object pose estimation, offering both practical benefits and a solid foundation for ongoing research.

PDF Markdown

Related Papers

GitHub

GitHub - Gorilla-Lab-SCUT/DualPoseNet: Code for "DualPoseNet: Category-level 6D Object Pose and Size EstimationUsing Dual Pose Network with Refined Learning of Pose Consistency" (41 stars)