Emergent Mind

Marrying NeRF with Feature Matching for One-step Pose Estimation

(2404.00891)
Published Apr 1, 2024 in cs.CV and cs.RO

Abstract

Given the image collection of an object, we aim at building a real-time image-based pose estimation method, which requires neither its CAD model nor hours of object-specific training. Recent NeRF-based methods provide a promising solution by directly optimizing the pose from pixel loss between rendered and target images. However, during inference, they require long converging time, and suffer from local minima, making them impractical for real-time robot applications. We aim at solving this problem by marrying image matching with NeRF. With 2D matches and depth rendered by NeRF, we directly solve the pose in one step by building 2D-3D correspondences between target and initial view, thus allowing for real-time prediction. Moreover, to improve the accuracy of 2D-3D correspondences, we propose a 3D consistent point mining strategy, which effectively discards unfaithful points reconstruted by NeRF. Moreover, current NeRF-based methods naively optimizing pixel loss fail at occluded images. Thus, we further propose a 2D matches based sampling strategy to preclude the occluded area. Experimental results on representative datasets prove that our method outperforms state-of-the-art methods, and improves inference efficiency by 90x, achieving real-time prediction at 6 FPS.
A framework for one-step pose estimation using a feature matching strategy from an initial pose.

Overview

  • The study introduces a novel framework combining Neural Radiance Fields (NeRF) and feature matching to facilitate one-step pose estimation without the need for CAD models, aiming at improving the speed and accuracy in robotics and augmented reality.

  • NeRF is used to generate high-quality 3D scene representations, while feature matching techniques are employed to establish correspondences between different views, resulting in rapid and accurate pose estimation.

  • Significant innovations include real-time image-based inference, a 3D consistent point mining strategy for enhanced accuracy, and a matching point-based sampling strategy to handle occlusions effectively.

  • The framework outperforms existing methods in efficiency and robustness, showing a 90-fold improvement in inference speed and real-time prediction capabilities at 6 FPS, highlighting its potential for practical applications in robotics and AR.

Introduction to the Study

Recent advances in Neural Radiance Fields (NeRF) have paved the way for significant improvements in realistic 3D scene representation and rendering. On the other hand, pose estimation remains a critical challenge in robotics and augmented reality (AR), traditionally relying on exhaustive feature matching and CAD models or suffering from extensive retraining for novel objects. The study discussed herein aims to reconcile these areas by proposing a novel framework that marries NeRF with feature matching, facilitating a one-step pose estimation process that obviates the need for CAD models and circumvents the extensive training phase.

Underpinning Technologies

The framework integrates two primary components: NeRF and feature matching. NeRF provides a potent mechanism for encoding complex 3D geometries efficiently, rendering high-quality 2D images from arbitrary viewpoints. Simultaneously, feature matching techniques, traditionally used in structure-from-motion (SfM) and SLAM algorithms, offer a reliable means to establish correspondence between different views of an object. Bridging these technologies allows for the leveraging of NeRF's high-fidelity depth rendering with the agility of feature matching, facilitating rapid pose estimation.

Core Contributions

The research introduces several innovative solutions to bolster pose estimation accuracy and expedite the estimation process:

  • Real-time Image-based Inference: The proposed method streamlines the pose estimation process, significantly reducing the iterations necessary for accurate pose approximation, thus enabling real-time inference capabilities.

  • 3D Consistent Point Mining Strategy: To counteract the inaccuracies inherent in depth information extracted from NeRF, the study presents a novel point mining strategy. This methodology effectively filters out unfaithful 3D points, refining the quality of 2D-3D correspondences and, by extension, the pose estimation accuracy.

  • Matching Point Based Sampling Strategy: This strategy adeptly handles occlusions by emphasizing the unoccluded regions indicated by matching points, thus preventing the optimization process from being misled by obscured parts of the image.

Performance Evaluation

The proposed method was subjected to rigorous evaluation against state-of-the-art techniques across various datasets, including synthetic and real-world scenarios. It not only demonstrated a significant enhancement in inference efficiency, with a 90-fold increase compared to previous NeRF-based methods, but also showcased superior robustness to occlusions, achieving real-time prediction at 6 FPS.

Theoretical and Practical Implications

From a theoretical perspective, this study bridges the gap between dense 3D scene representation facilitated by NeRF and the agility of feature matching techniques, providing fresh insights into efficient pose estimation methodologies. Practically, the framework's ability to perform CAD-free real-time pose estimation for novel objects makes it an attractive proposition for robotics, AR, and mobile robotics applications seeking to interact intelligently with an ever-changing environment.

Future Directions

The success of integrating NeRF with feature matching for pose estimation opens up several avenues for future research. Exploring the application of this methodology in robot manipulation and extending it to SLAM tasks present promising areas for extending the utility of this novel framework. Furthermore, the incorporation of machine learning algorithms for dynamic feature matching and the optimization of NeRF rendering could further enhance the efficiency and accuracy of pose estimation.

Conclusion

The proposed one-step pose estimation framework represents a significant stride towards real-time, accurate, and robust pose estimation for novel objects without reliance on CAD models or extensive retraining. By combining the strengths of NeRF and feature matching, the research paves the way for advanced applications in robotics and AR, ensuring seamless interaction with the 3D world.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

References
  1. T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Trans. Graph., vol. 41, no. 4, pp. 102:1–102:15, Jul. 2022.
  2. Y. Cong, R. Chen, B. Ma, H. Liu, D. Hou, and C. Yang, “A comprehensive study of 3-d vision-based robot manipulation,” IEEE Trans. Cybern., vol. 53, no. 3, pp. 1682–1698
  3. D. G. Lowe, “Object recognition from local scale-invariant features,” in Proceedings of the International Conference on Computer Vision, Kerkyra, Corfu, Greece, September 20-25, 1999.   IEEE Computer Society, 1999, pp. 1150–1157.
  4. S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua, and V. Lepetit, “Gradient response maps for real-time detection of textureless objects,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 5, pp. 876–888
  5. Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” Robotics: Science and Systems XIV
  6. C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese, “Densefusion: 6d object pose estimation by iterative dense fusion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3343–3352.
  7. S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel-wise voting network for 6dof pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4561–4570.
  8. Y. Di, R. Zhang, Z. Lou, F. Manhardt, X. Ji, N. Navab, and F. Tombari, “Gpv-pose: Category-level object pose estimation via geometry-guided point-wise voting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6781–6791.
  9. X. Chen, Z. Dong, J. Song, A. Geiger, and O. Hilliges, “Category level object pose estimation via neural analysis-by-synthesis,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16.   Springer, 2020, pp. 139–156.
  10. T. Lee, B.-U. Lee, M. Kim, and I. S. Kweon, “Category-level metric scale object shape and pose estimation,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 8575–8582
  11. K. Chen, S. James, C. Sui, Y.-H. Liu, P. Abbeel, and Q. Dou, “Stereopose: Category-level 6d transparent object pose estimation from stereo images via back-view nocs,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 2855–2861.
  12. J. Sun, Z. Wang, S. Zhang, X. He, H. Zhao, G. Zhang, and X. Zhou, “Onepose: One-shot object pose estimation without cad models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6825–6834.
  13. X. He, J. Sun, Y. Wang, D. Huang, H. Bao, and X. Zhou, “Onepose++: Keypoint-free one-shot object pose estimation without CAD models,” in Advances in Neural Information Processing Systems
  14. V. Lepetit, F. Moreno-Noguer, and P. Fua, “Ep n p: An accurate o (n) solution to the p n p problem,” International journal of computer vision, vol. 81, pp. 155–166
  15. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in ECCV
  16. L. Yen-Chen, P. Florence, J. T. Barron, A. Rodriguez, P. Isola, and T.-Y. Lin, “inerf: Inverting neural radiance fields for pose estimation,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 1323–1330.
  17. Y. Lin, T. Müller, J. Tremblay, B. Wen, S. Tyree, A. Evans, P. A. Vela, and S. Birchfield, “Parallel inversion of neural radiance fields for robust pose estimation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 9377–9384.
  18. K. Park, A. Mousavian, Y. Xiang, and D. Fox, “Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 710–10 719.
  19. A. Palazzi, L. Bergamini, S. Calderara, and R. Cucchiara, “End-to-end 6-dof object pose estimation through differentiable rasterization,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0.
  20. W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and S. Fidler, “Learning to predict 3d objects with an interpolation-based differentiable renderer,” Advances in neural information processing systems, vol. 32
  21. L. Yariv, J. Gu, Y. Kasten, and Y. Lipman, “Volume rendering of neural implicit surfaces,” Advances in Neural Information Processing Systems, vol. 34, pp. 4805–4815
  22. L. Yariv, Y. Kasten, D. Moran, M. Galun, M. Atzmon, B. Ronen, and Y. Lipman, “Multiview neural surface reconstruction by disentangling geometry and appearance,” Advances in Neural Information Processing Systems, vol. 33, pp. 2492–2502
  23. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction
  24. W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab, “Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1521–1529.
  25. M. Rad and V. Lepetit, “Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3828–3836.
  26. B. Tekin, S. N. Sinha, and P. Fua, “Real-time seamless single shot 6d object pose prediction,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 292–301.
  27. C. Song, J. Song, and Q. Huang, “Hybridpose: 6d object pose estimation under hybrid representations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 431–440.
  28. I. Shugurov, F. Li, B. Busam, and S. Ilic, “Osop: A multi-stage one shot object pose estimation framework,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6835–6844.
  29. Y. Labbé, L. Manuelli, A. Mousavian, S. Tyree, S. Birchfield, J. Tremblay, J. Carpentier, M. Aubry, D. Fox, and J. Sivic, “Megapose: 6d pose estimation of novel objects via render & compare,” in Proceedings of the 6th Conference on Robot Learning (CoRL)
  30. G. Ponimatkin, Y. Labbé, B. Russell, M. Aubry, and J. Sivic, “Focal length and object pose estimation via render and compare,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3825–3834.
  31. S. Liu, T. Li, W. Chen, and H. Li, “Soft rasterizer: A differentiable renderer for image-based 3d reasoning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7708–7717.
  32. F. Petersen, B. Goldluecke, C. Borgelt, and O. Deussen, “GenDR: A Generalized Differentiable Renderer,” in IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR)
  33. D. Maggio, M. Abate, J. Shi, C. Mario, and L. Carlone, “Loc-nerf: Monte carlo localization using neural radiance fields,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 4018–4025.
  34. E. Sucar, S. Liu, J. Ortiz, and A. J. Davison, “imap: Implicit mapping and positioning in real-time,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6229–6238.
  35. Z. Zhu, S. Peng, V. Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys, “Nice-slam: Neural implicit scalable encoding for slam,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 786–12 796.
  36. How NeRFs and 3D Gaussian Splatting are Reshaping SLAM: a Survey
  37. J. Tang, S. Miller, A. Singh, and P. Abbeel, “A textured object recognition pipeline for color and depth image data,” in 2012 IEEE International Conference on Robotics and Automation.   IEEE, 2012, pp. 3467–3474.
  38. M. Martinez, A. Collet, and S. S. Srinivasa, “Moped: A scalable and low latency object recognition and pose estimation system,” in 2010 IEEE International Conference on Robotics and Automation.   IEEE, 2010, pp. 2043–2049.
  39. A. Collet, D. Berenson, S. S. Srinivasa, and D. Ferguson, “Object recognition and full pose registration from a single image for robotic manipulation,” in 2009 IEEE International Conference on Robotics and Automation.   IEEE, 2009, pp. 48–55.
  40. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, pp. 91–110
  41. E. Rosten and T. Drummond, “Machine learning for high-speed corner detection,” in Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9.   Springer, 2006, pp. 430–443.
  42. E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International conference on computer vision.   Ieee, 2011, pp. 2564–2571.
  43. K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua, “Learning to find good correspondences,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2666–2674.
  44. R. Ranftl and V. Koltun, “Deep fundamental matrix estimation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 284–299.
  45. P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947.
  46. J. T. Kajiya and B. P. Von Herzen, “Ray tracing volume densities,” ACM SIGGRAPH computer graphics, vol. 18, no. 3, pp. 165–174
  47. J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “Loftr: Detector-free local feature matching with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8922–8931.
  48. P.-E. Sarlin, A. Unagar, M. Larsson, H. Germain, C. Toft, V. Larsson, M. Pollefeys, V. Lepetit, L. Hammarstrand, F. Kahl et al., “Back to the feature: Learning robust camera localization from pixels to pose,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3247–3257.
  49. M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395
  50. B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar, “Local light field fusion: Practical view synthesis with prescriptive sampling guidelines,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–14
  51. Gaussian-SLAM: Photo-realistic Dense SLAM with Gaussian Splatting
  52. Gaussian Splatting SLAM
  53. GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting
  54. SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

Show All 54

Test Your Knowledge

You answered out of questions correctly.

Well done!