RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering (2104.00633v2)

Published 1 Apr 2021 in cs.CV

Abstract: We present RePOSE, a fast iterative refinement method for 6D object pose estimation. Prior methods perform refinement by feeding zoomed-in input and rendered RGB images into a CNN and directly regressing an update of a refined pose. Their runtime is slow due to the computational cost of CNN, which is especially prominent in multiple-object pose refinement. To overcome this problem, RePOSE leverages image rendering for fast feature extraction using a 3D model with a learnable texture. We call this deep texture rendering, which uses a shallow multi-layer perceptron to directly regress a view-invariant image representation of an object. Furthermore, we utilize differentiable Levenberg-Marquardt (LM) optimization to refine a pose fast and accurately by minimizing the feature-metric error between the input and rendered image representations without the need of zooming in. These image representations are trained such that differentiable LM optimization converges within few iterations. Consequently, RePOSE runs at 92 FPS and achieves state-of-the-art accuracy of 51.6% on the Occlusion LineMOD dataset - a 4.1% absolute improvement over the prior art, and comparable result on the YCB-Video dataset with a much faster runtime. The code is available at https://github.com/sh8/repose.

Citations (74)

View on Semantic Scholar

Summary

The paper introduces a novel method combining deep texture rendering with a shallow MLP and differentiable LM optimization to rapidly refine 6D object poses.
It demonstrates state-of-the-art performance with a 4.1% improvement on Occlusion LineMOD and achieves 92 FPS on YCB-Video benchmarks.
The approach significantly reduces computational load by decoupling texture from shape features, making it ideal for real-time AR and robotic applications.

RePOSE: Accelerating 6D Object Pose Refinement

RePOSE introduces a novel methodology for refining 6D object poses by leveraging deep texture rendering, effectively overcoming the runtime limitations inherent in previous approaches that rely heavily on Convolutional Neural Networks (CNNs). The RePOSE framework integrates a shallow multi-layer perceptron to generate view-invariant image representations, which facilitates fast and precise pose refinement through differentiable Levenberg-Marquardt (LM) optimization.

Methodological Innovations

The core innovation of RePOSE lies in the deep texture rendering process, wherein a 3D model with learnable textures is employed for rapid feature extraction. This approach effectively decouples object shapes from textures by mapping textures to a 3D shape and projecting it into a 2D image representation. The network is designed to minimize projection errors between input image representations and their rendered counterparts derived directly from the 3D model. This approach results in a significant reduction in computational exhaustiveness compared to methods that repeatedly generate CNN-derived feature maps from zoomed input images.

Numerical Results and Performance

RePOSE demonstrates state-of-the-art performance, validated through evaluations on benchmark datasets such as Occlusion LineMOD and YCB-Video. It achieves a notable accuracy of 51.6% on the Occlusion LineMOD dataset with a substantial improvement of 4.1% over former methods, while maintaining comparability on the YCB-Video dataset at an increased velocity of 92 FPS. With these results, RePOSE runs substantially faster than other contemporary refinement techniques, particularly excelling in scenarios involving multiple-object poses which are common in augmented reality and robotic applications.

Analysis and Comparisons

The effectiveness of RePOSE is underscored through a comparison against classical refinement strategies and CNN-heavy approaches. The studies reveal that RePOSE not only provides superior data efficiency by relying less on expansive training datasets but also exhibits robustness against variances in object texture and illumination conditions which typically challenge photometric error-based classical techniques.

Further contrasted with other feature-based refinement methods, RePOSE benefits from its rendering strategy, which, unlike feature warping approaches, consistently renders a complete image representation anew, aiding convergence even with significant initial pose discrepancies.

Practical and Theoretical Implications

The practical implications of RePOSE are clear: it provides a swift, data-efficient pipeline for real-time 6D pose estimation critical in dynamic environments like robotic manipulation and real-time AR systems. From a theoretical standpoint, the integration of differentiable LM optimization within a learning framework sets an interesting precedent for the fusion of classical optimization techniques and deep learning features, potentially influencing future research to explore hybrid models in vision and graphics.

Prospects for Future Work

Looking forward, the prospect of enhancing RePOSE with more robust learning strategies for texture representations offers an exciting direction. As artificial intelligence research progresses, real-time performance integrated into more complex and cluttered settings will be crucial. Further exploration in reinforcement learning may enhance the adaptive capabilities of such frameworks, personalizing pose refinement based on real-time feedback and environmental dynamics.

RePOSE signifies a significant step in 6D object pose estimation, balancing accuracy and efficiency, thereby extending the operational scope of AI vision systems into applications demanding real-time performance without sacrificing precision.

PDF Markdown

Related Papers

GitHub

GitHub - sh8/RePOSE: Official Pytorch implementation of RePOSE (ICCV2021) (85 stars)

YouTube

Show All Videos