CosyPose: Consistent multi-view multi-object 6D pose estimation (2008.08465v1)

Published 19 Aug 2020 in cs.CV

Abstract: We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed CosyPose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets. Code and pre-trained models are available on the project webpage https://www.di.ens.fr/willow/research/cosypose/.

View on arXiv

Authors (4)

Yann Labbé (12 papers)
Justin Carpentier (37 papers)
Mathieu Aubry (50 papers)
Josef Sivic (78 papers)

Citations (391)

View on Semantic Scholar

Summary

Critical Analysis of "CosyPose: Consistent Multi-view Multi-object 6D Pose Estimation"

The paper "CosyPose: Consistent Multi-view Multi-object 6D Pose Estimation" addresses the challenging problem of estimating the 6D pose of multiple known objects within a scene using multiple RGB images captured from unknown camera positions. The authors propose a novel method, CosyPose, which advances the state of the art in this domain by combining single-view 6D pose estimation with robust multi-view consistency checks and a global optimization step for enhanced accuracy.

Key Contributions

The paper introduces a three-stage approach for addressing the problem:

Single-View 6D Pose Estimation: The authors present an enhanced approach inspired by DeepIM, incorporating technical improvements such as the use of EfficientNet-B3, a robust rotation parametrization, and disentangled depth and translation losses. This first stage outputs initial 6D pose hypotheses for each detected object in each view, serving as a foundation for the following stages.
Multi-view Candidate Matching: Employing a RANSAC-based matching strategy, CosyPose identifies consistent object candidates across different views by estimating relative camera poses. This stage ensures that only coherent object candidates are retained to form a consistent model of the scene.
Global Scene Refinement: The final scene model is refined through an object-level bundle adjustment, optimizing both object poses and camera positions to minimize the overall reprojection error across all views.

Experimental Evaluation

The effectiveness of CosyPose is empirically validated on two complex and diverse datasets: YCB-Video and T-LESS. The results reveal significant improvements over existing methods, demonstrating notable performance advancements in the challenging scenarios presented by these datasets. For instance, on the YCB-Video dataset, CosyPose achieves an AUC of 89.8% for single-view evaluations, surpassing prior art such as DeepIM. On the T-LESS dataset, the method shows a remarkable increase in the VSD metric by 34.2%, compared to previous approaches.

Implications and Future Work

The practical implications of CosyPose are substantial, particularly for robotic applications where accurate object localization and interaction in unknown environments is crucial. The robustness of this approach to asymmetric object appearances, occlusions, and missing detections is a critical advancement towards real-world applicability.

Theoretical implications include further exploration of object-level bundle adjustment frameworks and improved robustness to ambiguities in symmetrical object pose estimation. Future developments might involve extending CosyPose to seamlessly incorporate dynamic scenes or adapting it to leverage additional sensory inputs such as depth information for complex environmental reconstructions.

Overall, CosyPose represents a significant step towards robust and reliable multi-view 6D pose estimation, with potential applications spanning robotics, augmented reality, and beyond. The integration of efficient single-view estimation processes with robust multi-view consistency checks and global optimization constitutes a solid foundation for future research in advancing 3D scene understanding.

Related Papers

Find Related Papers