Recovering 6D Object Pose and Predicting Next-Best-View in the Crowd (1512.07506v2)

Published 23 Dec 2015 in cs.CV

Abstract: Object detection and 6D pose estimation in the crowd (scenes with multiple object instances, severe foreground occlusions and background distractors), has become an important problem in many rapidly evolving technological areas such as robotics and augmented reality. Single shot-based 6D pose estimators with manually designed features are still unable to tackle the above challenges, motivating the research towards unsupervised feature learning and next-best-view estimation. In this work, we present a complete framework for both single shot-based 6D object pose estimation and next-best-view prediction based on Hough Forests, the state of the art object pose estimator that performs classification and regression jointly. Rather than using manually designed features we a) propose an unsupervised feature learnt from depth-invariant patches using a Sparse Autoencoder and b) offer an extensive evaluation of various state of the art features. Furthermore, taking advantage of the clustering performed in the leaf nodes of Hough Forests, we learn to estimate the reduction of uncertainty in other views, formulating the problem of selecting the next-best-view. To further improve pose estimation, we propose an improved joint registration and hypotheses verification module as a final refinement step to reject false detections. We provide two additional challenging datasets inspired from realistic scenarios to extensively evaluate the state of the art and our framework. One is related to domestic environments and the other depicts a bin-picking scenario mostly found in industrial settings. We show that our framework significantly outperforms state of the art both on public and on our datasets.

Authors (4)

Andreas Doumanoglou (4 papers)
Rigas Kouskouridas (7 papers)
Sotiris Malassiotis (3 papers)
Tae-Kyun Kim (91 papers)

Citations (215)

View on Semantic Scholar

Summary

Recovering 6D Object Pose and Predicting Next-Best-View in Crowded Environments

This paper addresses two challenging problems in the field of computer vision: 6D object pose estimation and next-best-view prediction, specifically in cluttered scenes with multiple objects and significant occlusions. These tasks are of considerable importance in applications such as robotics and augmented reality, where understanding the spatial configuration of objects from sensor data is crucial.

Methodological Approach

The authors propose a comprehensive framework based on the use of Hough Forests, which they extend to handle not just object classification, but also 6D pose regression and view selection tasks. This work diverges from traditional methods by leveraging unsupervised feature learning through Sparse Autoencoders to obtain depth-invariant descriptors from RGB-D patches, offering a departure from manually designed features that have struggled with the challenges posed by real-world, cluttered environments.

Key components of the framework are as follows:

Unsupervised Feature Learning: Instead of relying on handcrafted features, the authors employ patches extracted from rendered synthetic images of objects and use a multilevel Sparse Autoencoder network to learn discriminative features from raw RGB-D data in an unsupervised manner. This addresses the variability in viewpoint, background clutter, and occlusion.
6D Pose Estimation: The system uses Hough Forests to cast votes in a 6D Hough space, which efficiently combines classification and regression tasks, ultimately estimating the object pose. Patch-level decision-making allows the method to be robust in cluttered scenes, overcoming difficulties linked with holistic methods.
Next-Best-View Prediction: Through an analysis of the entropy associated with potential object hypotheses, the framework can predict the next-best-viewpoint for a scene. This approach uses information stored in the leaves of the Hough Forest, enabling intelligent camera view planning that considers occlusion and reduces uncertainty in pose estimation.

Experimental Validation

The framework is extensively evaluated on both public and newly introduced datasets designed to mimic realistic domestic and industrial scenarios. The results showcase the framework's ability to significantly outperform state-of-the-art methods in terms of accuracy and robustness. Specifically, the new datasets underscore the limitations of conventional methods when confronted with real-world complexity, highlighting the practical advancements offered by the proposed approach.

Implications and Future Directions

The insights and methodologies presented in this paper have several implications in both theoretical and practical domains:

Feature Learning: The shift toward unsupervised feature learning for 6D pose estimation showcases a path toward more generalized and adaptable vision systems that can operate reliably in diverse environments where manually engineered features fail.
Robust Active Vision: The integration of next-best-view planning within pose estimation contexts opens avenues for more autonomous systems capable of dynamically adjusting their sensory strategies, thus bridging perception and action more closely.
Complex Scene Understanding: The methodology emphasizes the importance of tackling occlusions and clutter, pivotal areas for the deployment of AI systems in uncontrolled environments.

In terms of future research, exploration into how different patch sizes could be synthesized within a deep learning architecture may offer pathways to enhance robustness further. Moreover, integrating convolutional neural network structures within this framework could potentially lead to even richer, hierarchically-learned feature representations, directly addressing depth invariance and contributing to performance under varying lighting and texture conditions.

The proposed method not only takes a significant step forward in addressing the intricate challenges of 6D pose estimation and view planning but also sets groundwork for future advancements in the development of intelligent, perceptually capable AI systems.

PDF Markdown

Related Papers

YouTube

Show All Videos