Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cameras as Rays: Pose Estimation via Ray Diffusion

Published 22 Feb 2024 in cs.CV and cs.LG | (2402.14817v3)

Abstract: Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Direct linear transformation from comparator coordinates into object space coordinates in close-range photogrammetry. Photogrammetric engineering & remote sensing, 81(2):103–107, 2015.
  2. RelocNet: Continuous Metric Learning Relocalisation using Neural Nets. In ECCV, 2018.
  3. SURF: Speeded Up Robust Features. In ECCV, 2006.
  4. Extreme Rotation Estimation using Dense Correlation Volumes. In CVPR, 2021.
  5. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM. T-RO, 2021.
  6. Sparse 3d reconstruction via object-centric ray sampling. arXiv preprint arXiv:2309.03008, 2023.
  7. Wide-Baseline Relative Camera Pose Estimation with Directional Learning. In CVPR, 2021.
  8. MonoSLAM: Real-time Single Camera SLAM. TPAMI, 2007.
  9. SuperPoint: Self-supervised Interest Point Detection and Description. In CVPR-W, 2018.
  10. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021.
  11. Efficient Generic Calibration Method for General Cameras with Single Centre of Projection. Computer Vision and Image Understanding, 114(2):220–233, 2010.
  12. A general imaging model and a method for finding its parameters. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pp.  108–115. IEEE, 2001.
  13. Denoising Diffusion Probabilistic Models. NeurIPS, 2020.
  14. Few-View Object Reconstruction with Unknown Categories and Camera Poses. ArXiv, 2212.04492, 2022.
  15. A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses. TPAMI, 28(8):1335–1340, 2006.
  16. MegaDepth: Learning Single-View Depth Prediction from Internet Photos. In CVPR, 2018.
  17. RelPose++: Recovering 6D Poses from Sparse-view Observations. arXiv preprint arXiv:2305.04926, 2023.
  18. Pixel-Perfect Structure-from-Motion with Featuremetric Refinement. In ICCV, 2021.
  19. SparseNeuS: Fast Generalizable Neural Surface Reconstruction from Sparse Views. In ECCV, 2022.
  20. David G Lowe. Distinctive Image Features from Scale-invariant Keypoints. IJCV, 2004.
  21. An Iterative Image Registration Technique with an Application to Stereo Vision. In IJCAI, 1981.
  22. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras. T-RO, 2017.
  23. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. T-RO, 2015.
  24. DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193, 2023.
  25. Scalable Diffusion Models with Transformers. In ICCV, 2023.
  26. Julius Plücker. Analytisch-geometrische Entwicklungen, volume 2. GD Baedeker, 1828.
  27. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. In ICCV, 2021.
  28. The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs. In 3DV, 2022.
  29. From Coarse to Fine: Robust Hierarchical Localization at Large Scale. In CVPR, 2019.
  30. SuperGlue: Learning Feature Matching with Graph Neural Networks. In CVPR, 2020.
  31. Structure-from-Motion Revisited. In CVPR, 2016.
  32. Pixelwise View Selection for Unstructured Multi-View Stereo. In ECCV, 2016.
  33. BAD SLAM: Bundle Adjusted Direct RGB-D SLAM. In CVPR, 2019.
  34. Why Having 10,000 Parameters in Your Camera Model is Better Than Twelve. In CVPR, 2020.
  35. RANSAC-Flow: Generic Two-stage Image Alignment. In ECCV, 2020.
  36. SparsePose: Sparse-View Camera Pose Regression and Refinement. In CVPR, 2023.
  37. Photo Tourism: Exploring Photo Collections in 3D. In SIGGRAPH. ACM, 2006.
  38. BA-Net: Dense Bundle Adjustment Network. In ICLR, 2019.
  39. Bundle Adjustment—A Modern Synthesis. In International workshop on vision algorithms, 1999.
  40. Attention is All You Need. NeurIPS, 2017.
  41. PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment. In ICCV, 2023.
  42. Volumetric Correspondence Networks for Optical Flow. NeurIPS, 32, 2019.
  43. NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild. In NeurIPS, 2021.
  44. RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild. In ECCV, 2022.
  45. Stereo magnification: Learning view synthesis using multiplane images. SIGGRAPH, 37, 2018.
  46. SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction. In CVPR, 2023.
Citations (30)

Summary

  • The paper introduces a novel ray-based representation that redefines camera pose estimation using denoising diffusion models.
  • It employs a transformer-based architecture to regress ray representations, converting them into precise camera parameters with enhanced generalization.
  • Experimental results show improved rotation and center accuracies, even for in-the-wild captures and unseen object categories.

Cameras as Rays: Pose Estimation via Ray Diffusion

Introduction

The paper "Cameras as Rays: Pose Estimation via Ray Diffusion" addresses the critical task of estimating camera poses from sparsely sampled views, a challenge in applications requiring high-fidelity 3D reconstructions. Previous approaches primarily rely on top-down prediction methods, focusing on global parameterizations of camera extrinsics, but the authors propose a novel distributed representation that treats a camera as a bundle of rays. This approach enhances pose precision by tightly coupling the representation with spatial image features and adapting denoising diffusion models to sample plausible modes while handling uncertainties inherent in sparse-view pose inference.

Methodology

Ray Representation

The authors redefine camera parametrization by treating cameras as collections of rays instead of predicting global rotations and translations. Plücker coordinates are employed to parameterize rays, providing a robust method for capturing camera parameters in distributed form. Figure 1

Figure 1: Recovering Sparse-view Camera Parameters by Denoising Rays.

Conversion Between Representations

Converting traditional camera parameterizations into ray bundles involves unprojecting rays from camera centers to pixel coordinates. Conversely, the system solves optimization problems to revert ray bundles back into conventional camera parameters, enhancing the flexibility and precision of camera pose predictions. Figure 2

Figure 2: Converting Between Camera and Ray Representations.

Regression and Diffusion Approach

A transformer-based architecture is used to predict ray representations, processing image patches to estimate corresponding rays. This regression-based method initially showed state-of-the-art improvements. The approach was further extended using denoising diffusion models to account for ambiguity and uncertainty in sparse visual data, optimizing L2 loss to refine ray predictions. Figure 3

Figure 3: Denoising Ray Diffuser Network.

Experimental Results

The paper reports superior camera rotation and center accuracies against traditional methods like COLMAP and learning-based approaches such as RelPose and PoseDiffusion. The ray-based approaches efficiently generalize across both seen and unseen categories and demonstrate scalability when tested with larger image samples. Figure 4

Figure 4: Qualitative Comparison Between Predicted Camera Poses.

Figure 5

Figure 5: Generalization to In-the-wild Self-captures.

Discussion

The proposed ray-based representation effectively handles the complexities of sparse-view camera pose estimation, offering significant improvements over existing methods. The ability to leverage this encoding for probabilistic modeling allows capturing multiple plausible distributions of camera poses, addressing inherent uncertainties due to symmetries and limited observations. Figure 6

Figure 6: Modeling Uncertainty Via Sampling Modes.

The work presents substantial potential for future exploration in single-view and dense multi-view setups, with possibilities for integrating geometric consistency joint inference methods.

Conclusion

"Cameras as Rays: Pose Estimation via Ray Diffusion" introduces a promising framework for camera pose estimation in scenarios with sparse data availability, showcasing robust performance improvements and offering insights into the future of distributed camera representations. The approach redefines traditional parameterizations, enabling high precision and adaptability in 3D reconstructions and expanding the applications scope across unseen object categories and real-world scenarios.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 263 likes about this paper.