U-ARE-ME: Uncertainty-Aware Rotation Estimation in Manhattan Environments (2403.15583v1)

Published 22 Mar 2024 in cs.CV

Abstract: Camera rotation estimation from a single image is a challenging task, often requiring depth data and/or camera intrinsics, which are generally not available for in-the-wild videos. Although external sensors such as inertial measurement units (IMUs) can help, they often suffer from drift and are not applicable in non-inertial reference frames. We present U-ARE-ME, an algorithm that estimates camera rotation along with uncertainty from uncalibrated RGB images. Using a Manhattan World assumption, our method leverages the per-pixel geometric priors encoded in single-image surface normal predictions and performs optimisation over the SO(3) manifold. Given a sequence of images, we can use the per-frame rotation estimates and their uncertainty to perform multi-frame optimisation, achieving robustness and temporal consistency. Our experiments demonstrate that U-ARE-ME performs comparably to RGB-D methods and is more robust than sparse feature-based SLAM methods. We encourage the reader to view the accompanying video at https://callum-rhodes.github.io/U-ARE-ME for a visual overview of our method.

Summary

The paper introduces U-ARE-ME, a novel uncertainty-aware framework that estimates camera rotation using RGB images under the Manhattan World assumption.
It employs single-frame and multi-frame optimization with pixel-wise confidence measures, achieving robust performance at 40 fps on an NVIDIA 4090 GPU.
Extensive experiments on synthetic and real-world datasets confirm its competitive accuracy, resilience to image degradation, and utility in diverse environments.

Uncertainty-Aware Rotation Estimation in Manhattan Environments Using Unsupervised Learning

Introduction

The task of estimating camera rotation from a sequence of monocular images is critical for several applications in computer vision, such as visual odometry, image stabilization, and augmented reality. Traditional methods have relied on a combination of sensors, including cameras and inertial measurement units (IMUs), to achieve this goal. However, IMUs are prone to drift and are unsuitable for non-inertial reference frames, while depth sensors or known camera intrinsics are not always available, particularly in "in-the-wild" videos. Replacing or complementing these methods, the paper introduces U--ARE--ME, an algorithm for uncertainty-aware rotation estimation from uncalibrated RGB images using the Manhattan World assumption.

Related Work

The paper positions itself within the scope of rotation estimation methodologies, emphasizing the distinctions between approaches that utilize RGB, RGB-D, and depth data. Methods relying on RGB-D data and surface normal alignment, while accurate, are not suited for in-the-wild videos lacking depth sensing. Meanwhile, classic RGB approaches that depend on feature matching or vanishing point detection offer limited robustness and require known camera intrinsics. Highlighting recent advances in fast and accurate surface normal estimation from RGB inputs, the paper motivates the exploration of leveraging dense pixel-wise geometric priors for rotation estimation, an area less traversed by existing methods.

Methodology

U--ARE--ME capitalizes on the improvements in single-image surface normal estimation to align predicted normals with the scene's principal directions under the Manhattan World assumption. The innovation lies in the introduction of an uncertainty-aware optimization framework that incorporates pixel-wise confidence measures and extends the approach to multi-frame analysis for enhanced temporal consistency. The key components of the method include:

Single-Frame Optimization: Formulates an uncertainty-weighted cost function to align the world-to-camera rotation matrix with the predicted surface normals, using the Levenberg-Marquardt algorithm for optimization. This step accounts for the aleatoric uncertainty in surface normal predictions, down-weighting unreliable pixels near object boundaries or within textureless regions.
Multi-Frame Optimization: Employs a sliding window approach and factor graph optimization to integrate rotation estimates across frames, addressing the epistemic uncertainty when limited principal directions are visible. This process stabilizes the global consistency and rejects outlier rotations using the estimated covariance matrices from the single-frame optimization step.

The method demonstrates real-time performance, achieving 40 fps on an NVIDIA 4090 GPU, illustrating a practical approach to accurate and robust rotation estimation in diverse environments without reliance on depth data or camera intrinsics.

Experiments and Results

The experimental validation of U--ARE--ME includes comparisons with existing RGB, RGB-D, and SLAM methods across various datasets, showcasing its competitive accuracy and robustness. Particularly, the paper assesses the algorithm's performance on synthetic and real indoor scenes from ICL-NUIM and TUM RGB-D datasets, and challenging real-world scenarios from the ScanNet dataset. The results underscore the algorithm's resilience to image degradation, motion blur, and textureless environments. Additionally, the inclusion of an ablation paper further elucidates the contribution of each component of the method to the overall performance.

Implications and Future Directions

This research opens new avenues for rotation estimation in scenarios where traditional sensor setups are impractical or unavailable. The proposed method's reliance on RGB input alone, coupled with its robust handling of uncertainty, makes it a versatile tool for a wide range of applications in computer vision, robotics, and augmented reality. Future work could explore extending this approach to accommodate different world assumptions and further refine the optimization mechanisms to enhance accuracy and computational efficiency.

Conclusion

U--ARE--ME represents a significant step forward in the domain of camera rotation estimation, providing a novel, efficient, and accurate method that operates solely on RGB image sequences. By effectively leveraging advancements in surface normal estimation and introducing an innovative uncertainty-aware optimization framework, this algorithm sets a new benchmark for performance in challenging and diverse environments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ducha_aiki/status/1772537973868728371

https://twitter.com/CSVisionPapers/status/1772594349412438050