RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery (2309.10255v2)

Published 19 Sep 2023 in cs.CV

Abstract: While showing promising results, recent RGB-D camera-based category-level object pose estimation methods have restricted applications due to the heavy reliance on depth sensors. RGB-only methods provide an alternative to this problem yet suffer from inherent scale ambiguity stemming from monocular observations. In this paper, we propose a novel pipeline that decouples the 6D pose and size estimation to mitigate the influence of imperfect scales on rigid transformations. Specifically, we leverage a pre-trained monocular estimator to extract local geometric information, mainly facilitating the search for inlier 2D-3D correspondence. Meanwhile, a separate branch is designed to directly recover the metric scale of the object based on category-level statistics. Finally, we advocate using the RANSAC-P$n$P algorithm to robustly solve for 6D object pose. Extensive experiments have been conducted on both synthetic and real datasets, demonstrating the superior performance of our method over previous state-of-the-art RGB-based approaches, especially in terms of rotation accuracy. Code: https://github.com/goldoak/DMSR.

Citations (5)

View on Semantic Scholar

Summary

The paper decouples metric scale recovery from 6D pose estimation, addressing scale ambiguity in RGB-only methods.
It employs a transformer-based 2D-3D correspondence learning and RANSAC-PnP to significantly boost rotation accuracy.
The approach mitigates hardware constraints, paving the way for robust AR and robotic systems using standard RGB sensors.

Overview of RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery

The paper "RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery" addresses the challenge of object pose estimation using RGB images without the reliance on depth sensors. This task is essential in computer vision and robotics, particularly for applications needing accurate spatial perception in three-dimensional spaces. Traditionally, RGB-D methods rely heavily on depth sensors, but these can be limiting due to hardware requirements and environmental constraints. The approach proposed in this paper overcomes the inherent scale ambiguity in RGB-only methods by decoupling object pose and size estimation, thereby enhancing robustness and accuracy.

Methodology

The proposed framework decouples the 6D pose estimation from size estimation. This decoupling is designed to prevent inaccuracies in scale measurement from affecting the rotational transformation. At the core, the method utilizes a pre-trained monocular estimator to extract local geometric information, aiding in the identification of inlier 2D-3D correspondences. A separate network branch estimates the metric scale of objects using category-level statistics, which is critical as it allows for more reliable pose estimation.

The method employs a transformer architecture to facilitate 2D-3D correspondence learning, leveraging semantic features, geometric features derived from depth and normal maps, and category-level shape priors. The pose is then computed using the RANSAC-P $n$ P algorithm, ensuring robustness against outliers.

Experimental Results

The authors provide extensive experimental validation on synthetic and real datasets (CAMERA25 and REAL275). Their method demonstrates superior performance compared to state-of-the-art RGB-based methods, particularly excelling in rotation accuracy. For instance, the improvement is significant over baseline methods like Synthesis, MSOS, and OLD-Net in rotation metrics and 3D Intersection-Over-Union (IoU) thresholds. This robustness indicates the efficacy of decoupling scale from pose estimation, as well as the power of leveraging 2.5D geometric cues from depth and normal predictions.

Implications and Future Directions

The separation of pose and scale estimation in RGB imagery holds potential for various real-world applications where depth sensors might be impractical. Such technology could enhance augmented reality systems and improve robotic perception systems, especially in consumer devices like smartphones and VR headsets.

However, the framework's reliance on category-level statistics and pre-trained models like Omnidata might limit generalization across diverse environments and objects with high intra-class variability. Future research could focus on integrating unsupervised learning techniques to better handle unknown categories and leveraging more advanced neural architectures to strengthen feature extraction and correspondence learning.

In summary, this paper advances the field of RGB-based object pose estimation by presenting a method that effectively separates the estimation of metric scale from 6D pose computation, thus circumventing traditional limitations of depth sensor-based approaches. The experimental findings validate its potential application in environments where hardware simplicity and cost-effectiveness are paramount.