Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation (2403.03221v1)

Published 5 Mar 2024 in cs.CV

Abstract: Estimating relative camera poses between images has been a central problem in computer vision. Methods that find correspondences and solve for the fundamental matrix offer high precision in most cases. Conversely, methods predicting pose directly using neural networks are more robust to limited overlap and can infer absolute translation scale, but at the expense of reduced precision. We show how to combine the best of both methods; our approach yields results that are both precise and robust, while also accurately inferring translation scales. At the heart of our model lies a Transformer that (1) learns to balance between solved and learned pose estimations, and (2) provides a prior to guide a solver. A comprehensive analysis supports our design choices and demonstrates that our method adapts flexibly to various feature extractors and correspondence estimators, showing state-of-the-art performance in 6DoF pose estimation on Matterport3D, InteriorNet, StreetLearn, and Map-free Relocalization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. PlaneFormers: From sparse view planes to 3d reconstruction. In ECCV, 2022.
  2. Map-free visual relocalization: Metric pose relative to a single image. In ECCV, 2022.
  3. MAGSAC: marginalizing sample consensus. In CVPR, 2019.
  4. MAGSAC++, a fast, reliable and accurate robust estimator. In CVPR, 2020.
  5. Two-view geometry scoring without correspondences. In CVPR, 2023.
  6. SURF: Speeded up robust features. In ECCV, 2006.
  7. DSAC-differentiable RANSAC for camera localization. In CVPR, 2017.
  8. G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
  9. Simultaneous localization and mapping: A survey of current trends in autonomous driving. T-IV, 2017.
  10. Extreme rotation estimation using dense correlation volumes. In CVPR, 2021.
  11. Doppelgangers: Learning to disambiguate images of similar structures. In ICCV, 2023.
  12. Matterport3D: Learning from RGB-D data in indoor environments. In 3DV, 2017.
  13. ASpanFormer: Detector-free image matching with adaptive span transformer. In ECCV, 2022.
  14. Wide-baseline relative camera pose estimation with directional learning. In CVPR, 2021.
  15. DeepFactors: Real-time probabilistic dense monocular SLAM. RA-L.
  16. SuperPoint: Self-supervised interest point detection and description. In CVPRW, 2018.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  18. D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. In CVPR, 2019.
  19. DKM: Dense kernelized feature matching for geometry estimation. In CVPR, 2023.
  20. RPNet: An end-to-end network for relative camera pose estimation. In ECCVW, 2018.
  21. William Falcon and The PyTorch Lightning team. PyTorch Lightning, 2019.
  22. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 1981.
  23. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
  24. Multiple view geometry in computer vision. Cambridge university press, 2003.
  25. Richard I Hartley. Estimation of relative camera positions for uncalibrated cameras. In ECCV. Springer, 1992.
  26. Richard I Hartley. In defense of the eight-point algorithm. TPAMPI, 19(6):580–593, 1997.
  27. Deep residual learning for image recognition. In CVPR, 2016.
  28. Adaptive assignment for geometry aware local feature matching. In CVPR, 2023.
  29. Few-view object reconstruction with unknown categories and camera poses. arXiv preprint arXiv:2212.04492, 2022.
  30. Planar surface reconstruction from sparse views. In ICCV, 2021.
  31. End-to-end learning of geometry and context for deep stereo regression. In ICCV, 2017.
  32. StereoNet: Guided hierarchical refinement for real-time edge-aware depth prediction. In ECCV, 2018.
  33. GRelPose: Generalizable end-to-end relative camera pose regression. arXiv preprint arXiv:2211.14950, 2022.
  34. Adam: A method for stochastic optimization. ICLR, 2015.
  35. Scenes: Subpixel correspondence estimation with epipolar supervision. 2024.
  36. Xvo: Generalized visual odometry via cross-modal self-training. In ICCV, 2023.
  37. Video autoencoder: self-supervised disentanglement of static 3d structure and motion. In ICCV, 2021.
  38. Beyond grobner bases: Basis selection for minimal solvers. In CVPR, 2018.
  39. Five-point motion estimation made easy. In ICPR, 2006.
  40. InteriorNet: Mega-scale multi-sensor photo-realistic indoor scenes dataset. In BVMC, 2018.
  41. RelPose++: Recovering 6d poses from sparse-view observations. arXiv preprint arXiv:2305.04926, 2023.
  42. BARF: Bundle-adjusting neural radiance fields. In ICCV, 2021.
  43. LightGlue: Local Feature Matching at Light Speed. In ICCV, 2023.
  44. H Christopher Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Nature, 293(5828):133–135, 1981.
  45. David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV.
  46. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS, 32, 2019.
  47. Robust wide-baseline stereo from maximally stable extremal regions. Image and vision computing, 22(10):761–767, 2004.
  48. The StreetLearn environment and dataset. arXiv preprint arXiv:1903.01292, 2019.
  49. WxBS: Wide baseline stereo generalizations. BMVC, 2015.
  50. ORB-SLAM: a versatile and accurate monocular SLAM system. T-RO.
  51. PATS: Patch area transportation with subdivision for local feature matching. In CVPR, 2023.
  52. David Nistér. An efficient solution to the five-point relative pose problem. TPAMI, 2004.
  53. PyTorch: An imperative style, high-performance deep learning library. NeurIPS, 32, 2019.
  54. Wide baseline stereo matching. In ICCV, 1998.
  55. Associative3D: Volumetric reconstruction from sparse views. In ECCV, 2020.
  56. USAC: A universal framework for random sample consensus. IEEE transactions on pattern analysis and machine intelligence, 2012.
  57. Deep fundamental matrix estimation. In ECCV, 2018.
  58. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 2020.
  59. Vision transformers for dense prediction. In ICCV, 2021.
  60. Plane-based odometry using an RGB-D camera. In BMVC.
  61. Kornia: an open source differentiable computer vision library for PyTorch. In WACV, 2020.
  62. The 8-point algorithm as an inductive bias for relative pose prediction by ViTs. In 3DV, 2022.
  63. End2End multi-view feature matching with differentiable pose optimization. In ICCV, 2023.
  64. ORB: An efficient alternative to SIFT or SURF. In ICCV, 2011.
  65. SuperGlue: Learning feature matching with graph neural networks. In CVPR, 2020.
  66. Habitat: A platform for embodied AI research. In ICCV, 2019.
  67. Structure-from-motion revisited. In CVPR, 2016.
  68. SparsePose: Sparse-view camera pose regression and refinement. In CVPR, 2023.
  69. Super-Convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, 2019.
  70. LoFTR: Detector-free local feature matching with transformers. In CVPR, 2021.
  71. NOPE-SAC: Neural one-plane RANSAC for sparse-view planar 3d reconstruction. TPAMI, 2023.
  72. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. NeurIPS, 2021.
  73. Deep patch visual odometry. arXiv preprint arXiv:2208.04726, 2022.
  74. MLESAC: A new robust estimator with application to estimating image geometry. Computer vision and image understanding, 78(1):138–156, 2000.
  75. PoseDiffusion: Solving pose estimation via diffusion-aided bundle adjustment. In ICCV, 2023.
  76. TartanVO: A generalizable learning-based VO. In CoRL, 2021.
  77. Generalized differentiable RANSAC. In ICCV, 2023.
  78. Ross Wightman. PyTorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  79. Extreme relative pose estimation for RGB-D scans via scene completion. In CVPR, 2019.
  80. Extreme relative pose network under hybrid representations. In CVPR, 2020.
  81. DS-SLAM: A semantic visual SLAM towards dynamic environments. In IROS, 2018.
  82. RelPose: Predicting probabilistic relative rotation for single objects in the wild. In ECCV, 2022.
  83. Progressive correspondence pruning by consensus learning. In ICCV, 2021.
  84. On the continuity of rotation representations in neural networks. In CVPR, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chris Rockwell (9 papers)
  2. Nilesh Kulkarni (17 papers)
  3. Linyi Jin (12 papers)
  4. Jeong Joon Park (24 papers)
  5. Justin Johnson (56 papers)
  6. David F. Fouhey (32 papers)
Citations (4)

Summary

Overview of "FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation"

The paper "FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation" introduces an innovative approach in the field of computer vision, particularly focusing on the estimation of relative camera poses between images. This task is of critical importance in applications such as augmented reality, robotics, and autonomous driving. The paper seeks to enhance the precision and robustness of camera pose estimations through a novel integration of classical and learning-based methods.

Summary of the Approach

The authors address the trade-off between accuracy and robustness that is prevalent in existing methods. Traditional methods that rely on correspondences and solve for the fundamental matrix are noted for their high precision but struggle with large view changes and lack the ability to compute translation scale. On the other hand, deep learning approaches directly predicting pose are mentioned to be more robust and capable of gauging translation scale, though often less precise.

FAR achieves a balance by employing a Transformer-based architecture that synergistically utilizes both correspondence predictions and a solver for pose estimation. The Transformer model in FAR takes dense feature inputs and learns to weigh between solved and learned pose outputs, subsequently guiding a solver. This dual-path approach enables the system to adapt across varied input scenarios, leveraging strengths from both correspondence-based and learning-based methods.

Experimental Findings

The empirical evaluations in the paper assert that FAR outperforms existing state-of-the-art methods in terms of both accuracy and robustness across multiple datasets including Matterport3D, InteriorNet, StreetLearn, and Map-free Relocalization. The paper provides quantitative evidence of its superior performance through metrics such as mean and median rotation and translation errors.

For instance, in the Matterport3D dataset, FAR achieves significant improvements over previous methods, marking reductions in both median and mean translation errors, while also achieving better rotation precision. Notably, FAR exhibits high efficacy in low correspondence settings where robustness to noise and outliers is critical.

Implications and Future Directions

The robust performance of FAR across various challenging datasets highlights its practical applicability in real-world scenarios that involve complex camera movements and environmental conditions. As FAR efficiently synergizes classical pose algorithms with contemporary Transformers, it opens avenues for further exploration into hybrid models that can leverage domain-specific knowledge while integrating advanced learning techniques.

The adaptability of FAR across different feature extractors and correspondence estimators sets a precedent for future research aimed at generalizing camera pose estimation models across diverse real-world situations. Further enhancements might involve exploring more sophisticated forms of priors or fully integrating FAR into a broader context of SLAM systems, potentially enhancing performance in even more dynamic environments.

In summary, through its innovative integration of classical and neural network-based concepts, the FAR framework offers a promising direction for advancing the accuracy and robustness of camera pose estimation, underscoring significant potential for impact in various computer vision applications. The flexibility, accuracy, and robustness of FAR position it as a versatile tool for addressing the challenges of 6DoF relative camera pose estimation in future developments.