Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization (2401.08937v1)

Published 17 Jan 2024 in cs.CV

Abstract: Neural Radiance Fields (NeRF) exhibit remarkable performance for Novel View Synthesis (NVS) given a set of 2D images. However, NeRF training requires accurate camera pose for each input view, typically obtained by Structure-from-Motion (SfM) pipelines. Recent works have attempted to relax this constraint, but they still often rely on decent initial poses which they can refine. Here we aim at removing the requirement for pose initialization. We present Incremental CONfidence (ICON), an optimization procedure for training NeRFs from 2D video frames. ICON only assumes smooth camera motion to estimate initial guess for poses. Further, ICON introduces ``confidence": an adaptive measure of model quality used to dynamically reweight gradients. ICON relies on high-confidence poses to learn NeRF, and high-confidence 3D structure (as encoded by NeRF) to learn poses. We show that ICON, without prior pose initialization, achieves superior performance in both CO3D and HO3D versus methods which use SfM pose.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Neural rgb-d surface reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6290–6301, June 2022.
  2. The bas-relief ambiguity. International journal of computer vision, 1999.
  3. A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):239–256, 1992. 10.1109/34.121791. https://doi.org/10.1109/34.121791.
  4. Nope-nerf: Optimising neural radiance field with no pose prior. 2023.
  5. SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  6. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  7. NOD-TAMP: Multi-step manipulation planning with neural object descriptors. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition @ CoRL2023, 2023. https://openreview.net/forum?id=43MSbj5mSS.
  8. Gaussian activated neural radiance fields for high fidelity reconstruction and pose estimation. In The European Conference on Computer Vision: ECCV, 2022.
  9. Scannet: Richly-annotated 3d reconstructions of indoor scenes. Computer Vision and Pattern Recognition (CVPR), pages 5828–5839, 2017. 10.1109/CVPR.2017.618. http://www.scan-net.org/.
  10. Davison. Real-time simultaneous localisation and mapping with a single camera. In Proceedings Ninth IEEE International Conference on Computer Vision, pages 1403–1410. IEEE, 2003.
  11. Monoslam: Real-time single camera slam. IEEE transactions on pattern analysis and machine intelligence, 29(6):1052–1067, 2007.
  12. Superpoint: Self-supervised interest point detection and description. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 337–33712, 2017. https://api.semanticscholar.org/CorpusID:4918026.
  13. Lsd-slam: Large-scale direct monocular slam. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pages 834–849. Springer, 2014.
  14. Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625, 2017.
  15. Ego4d: Around the world in 3,000 hours of egocentric video. In Computer Vision and Pattern Recognition, 2022.
  16. Honnotate: A method for 3d annotation of hand and object poses. In Computer Vision and Pattern Recognition, 2020.
  17. Multiple View Geometry in Computer Vision. Cambridge University Press, USA, 2 edition, 2003. ISBN 0521540518.
  18. Self-calibrating neural radiance fields. In International Conference on Computer Vision, 2021.
  19. Real-time perception meets reactive motion generation. IEEE Robotics and Automation Letters, 3(3):1864–1871, 2018. 10.1109/LRA.2018.2795645.
  20. Parallel tracking and mapping for small ar workspaces. In 2007 6th IEEE and ACM international symposium on mixed and augmented reality, pages 225–234. IEEE, 2007.
  21. Neroic: Neural rendering of objects from online image collections. ACM Trans. Graph., 41(4), jul 2022. ISSN 0730-0301. 10.1145/3528223.3530177. https://doi.org/10.1145/3528223.3530177.
  22. Cosypose: Consistent multi-view multi-object 6d pose estimation. In European Conference on Computer Vision, 2020.
  23. Megapose: 6d pose estimation of novel objects via render & compare. arXiv preprint arXiv:2212.06870, 2022.
  24. Barf: Bundle-adjusting neural radiance fields. In IEEE International Conference on Computer Vision (ICCV), 2021.
  25. Parallel inversion of neural radiance fields for robust pose estimation. In ICRA, 2023.
  26. Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images. In European Conference on Computer Vision, pages 298–315. Springer, 2022.
  27. Marching cubes: A high-resolution 3d surface construction algorithm. Computer Graphics, 21(4):163–169, 1987. 10.1145/37402.37422. https://doi.org/10.1145/37402.37422.
  28. David G. Lowe. Object recognition from local scale-invariant features. International Conference on Computer Vision (ICCV), pages 1150–1157, 1999. 10.1109/ICCV.1999.790410. https://www.cs.ubc.ca/~lowe/papers/iccv99.pdf.
  29. Pose estimation for augmented reality: A hands-on survey. IEEE Transactions on Visualization and Computer Graphics, 22(12):2633–2651, 2016. 10.1109/TVCG.2015.2513408.
  30. Fusion++: Volumetric object-level slam. In 2018 international conference on 3D vision (3DV), pages 32–41. IEEE, 2018.
  31. GNeRF: GAN-based Neural Radiance Field without Posed Camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  32. Symmetry and uncertainty-aware object slam for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14901–14910, 2022.
  33. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
  34. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, 2020.
  35. Seeing behind objects for 3d multi-object tracking in rgb-d sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6071–6080, 2021.
  36. Extracting Triangular 3D Models, Materials, and Lighting From Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8280–8290, June 2022.
  37. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics, 33(5):1255–1262, 2017.
  38. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163, 2015.
  39. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In International Conference on Computer Vision (ICCV), 2021.
  40. Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10710–10719, 2020.
  41. Simtrack: A simulation-based framework for scalable real-time object pose detection and tracking. In International Conference on Intelligent Robots and Systems, 2015.
  42. General in-hand object rotation with vision and touch. In 7th Annual Conference on Robot Learning, 2023. https://openreview.net/forum?id=RN00jfIV-X.
  43. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In International Conference on Computer Vision, 2021.
  44. Nerf-slam: Real-time dense monocular slam with neural radiance fields. arXiv preprint arXiv:2210.13641, 2022.
  45. Maskfusion: Real-time recognition, tracking and reconstruction of multiple moving objects. In 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 10–20. IEEE, 2018.
  46. Slam++: Simultaneous localisation and mapping at the level of objects. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1352–1359, 2013.
  47. From coarse to fine: Robust hierarchical localization at large scale. In CVPR, 2019.
  48. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
  49. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
  50. Compositional and scalable object slam. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 11626–11632. IEEE, 2021.
  51. Flowcam: Training generalizable 3d radiance fields without camera poses via pixel-aligned scene flow, 2023.
  52. Iterative corresponding geometry: Fusing region and depth for highly efficient 3d tracking of textureless objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6855–6865, 2022.
  53. NeuralRecon: Real-time coherent 3D reconstruction from monocular video. CVPR, 2021.
  54. Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of the european conference on computer vision (ECCV), pages 699–715, 2018.
  55. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II, page 402–419, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58535-8. 10.1007/978-3-030-58536-5_24. https://doi.org/10.1007/978-3-030-58536-5_24.
  56. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. Advances in neural information processing systems, 2021.
  57. Sparf: Neural radiance fields from sparse and noisy poses. In Computer Vision and Pattern Recognition, 2023.
  58. NeuralDiff: Segmenting 3D objects that move in egocentric videos. In Proceedings of the International Conference on 3D Vision (3DV), 2021.
  59. Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(04):376–380, 1991.
  60. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2642–2651, 2019.
  61. Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment. In International Conference on Computer Vision, 2023.
  62. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021a.
  63. Tartanvo: A generalizable learning-based vo. 2020.
  64. NeRF−⁣−--- -: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021b.
  65. Bundletrack: 6d pose tracking for novel objects without instance or category-level 3d models. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 8067–8074. IEEE Press, 2021. 10.1109/IROS51168.2021.9635991. https://doi.org/10.1109/IROS51168.2021.9635991.
  66. Catgrasp: Learning category-level task-relevant grasping in clutter from simulation. ICRA 2022, 2022a.
  67. You only demonstrate once: Category-level manipulation from single visual demonstration. ArXiv, abs/2201.12716, 2022b. https://api.semanticscholar.org/CorpusID:246430152.
  68. Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. Computer Vision and Pattern Recognition, 2023.
  69. Sinerf: Sinusoidal neural radiance fields for joint pose estimation and scene reconstruction. In 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. BMVA Press, 2022. https://bmvc2022.mpi-inf.mpg.de/0131.pdf.
  70. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. In Robotics: Science and Systems (RSS), 2018.
  71. Track anything: Segment anything meets videos, 2023.
  72. Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems, 33, 2020.
  73. Volume rendering of neural implicit surfaces. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  74. iNeRF: Inverting neural radiance fields for pose estimation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.
  75. pixelNeRF: Neural radiance fields from one or few images. In CVPR, 2021.
  76. NeRS: Neural reflectance surfaces for sparse-view 3d reconstruction in the wild. In Conference on Neural Information Processing Systems, 2021.
  77. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 10.1109/CVPR.2018.00068.
  78. A tutorial on quantitative trajectory evaluation for visual (-inertial) odometry. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7244–7251. IEEE, 2018.
  79. Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. In European conference on computer vision (ECCV), 2022.
  80. Direct sparse mapping. IEEE Transactions on Robotics, 36(4):1363–1370, 2020.

Summary

  • The paper presents a novel method that alternately refines camera poses and NeRFs using an incremental confidence strategy.
  • The paper utilizes a Neural Confidence Field derived from photometric error to adaptively weight gradient updates for pose and 3D reconstruction.
  • The paper achieves competitive view synthesis on object-centric and forward-facing scenes, rivaling methods that rely on depth inputs.

Understanding ICON: A Technique for Complex 3D Reconstruction Tasks

Introduction

The reconstruction of 3D models from 2D images is a significant challenge in computer vision, holding potential for advances in diverse areas such as virtual reality and robotics. A particularly promising approach for achieving this is using Neural Radiance Fields (NeRF), which has shown impressive results in synthesizing novel views from a given set of images. Nonetheless, the efficient training of NeRFs is highly dependent on possessing accurate camera poses for each image. Recovery of these poses traditionally relies on Structure-from-Motion (SfM) tools, which can be restrictive. A novel optimization procedure called ICON (Incremental CONfidence) presents a solution that breaks away from prior reliance on SfM-initialized poses. ICON leverages the property of smooth camera motion to incrementally estimate camera poses for training NeRFs directly from video frames.

Methodology Insight

ICON's approach focuses on both camera pose estimation and 3D reconstruction. Studies show that when camera pose information is uncertain or noisy, it becomes a barrier for accurate 3D environment mapping. ICON addresses this by employing an adaptive strategy: "When pose is good, learn the NeRF; when the NeRF is good, learn pose." This is achieved through a concept termed 'confidence', which is a measure of certainty in the model's understanding of spatial locations. The ICON model adapts the learning process based on this measure, allocating more weight to gradient updates from high-confidence data points.

Confidence Measure

The development of a 'Neural Confidence Field' within ICON is key. This field is superimposed on top of the NeRF, encoding confidence at each point in the 3D space. The measure of confidence for poses is derived from the photometric error at the pixel level, where a lower error indicates higher confidence. The model uses this confidence metric to weigh the optimization of camera poses and the NeRF model itself. If a pose estimate does not acquire sufficient confidence at any point, ICON includes a strategy to reinitialize that pose, akin to strategies used in traditional SfM algorithms.

Evaluation and Applications

ICON's performance in joint pose-and-3D reconstruction excels in comparison with other RGB-based methods and even näively compares well with state-of-the-art RGB-D methods, omitting the need for depth sensory inputs. Several datasets, such as CO3D and HO3D, were used for quantitative evaluation. ICON particularly thrives in object-centric dataset scenarios, managing to estimate very precise poses and deliver high-fidelity view synthesis. Notably, ICON is flexible enough to function well even outside strictly object-centric cases, as demonstrated by its performance on forward-facing scenes. This flexibility implies ICON's potential applicability in a broad range of scenarios, from VR to robotics.

Future Directions

Despite its success, ICON has areas for improvement. The method's reliance on photometric loss means it is less robust in cases of inconsistent visuals resulting from motion, reflective surfaces, lighting changes, and transparency. Also, the long training times inherited from utilizing NeRF models suggest that integrating ICON with faster 3D scene representation methods could benefit efficiency and performance. A logical next step is incorporating robust feature representations and more rapid rendering techniques in ICON's framework.

Conclusively, ICON achieves an innovative advancement in the way NeRFs can be trained for the reconstruction of 3D models from 2D video frames, demonstrating promising avenues for further research and application in this domain.