ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization (2401.08937v1)
Abstract: Neural Radiance Fields (NeRF) exhibit remarkable performance for Novel View Synthesis (NVS) given a set of 2D images. However, NeRF training requires accurate camera pose for each input view, typically obtained by Structure-from-Motion (SfM) pipelines. Recent works have attempted to relax this constraint, but they still often rely on decent initial poses which they can refine. Here we aim at removing the requirement for pose initialization. We present Incremental CONfidence (ICON), an optimization procedure for training NeRFs from 2D video frames. ICON only assumes smooth camera motion to estimate initial guess for poses. Further, ICON introduces ``confidence": an adaptive measure of model quality used to dynamically reweight gradients. ICON relies on high-confidence poses to learn NeRF, and high-confidence 3D structure (as encoded by NeRF) to learn poses. We show that ICON, without prior pose initialization, achieves superior performance in both CO3D and HO3D versus methods which use SfM pose.
- Neural rgb-d surface reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6290–6301, June 2022.
- The bas-relief ambiguity. International journal of computer vision, 1999.
- A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):239–256, 1992. 10.1109/34.121791. https://doi.org/10.1109/34.121791.
- Nope-nerf: Optimising neural radiance field with no pose prior. 2023.
- SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- NOD-TAMP: Multi-step manipulation planning with neural object descriptors. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition @ CoRL2023, 2023. https://openreview.net/forum?id=43MSbj5mSS.
- Gaussian activated neural radiance fields for high fidelity reconstruction and pose estimation. In The European Conference on Computer Vision: ECCV, 2022.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. Computer Vision and Pattern Recognition (CVPR), pages 5828–5839, 2017. 10.1109/CVPR.2017.618. http://www.scan-net.org/.
- Davison. Real-time simultaneous localisation and mapping with a single camera. In Proceedings Ninth IEEE International Conference on Computer Vision, pages 1403–1410. IEEE, 2003.
- Monoslam: Real-time single camera slam. IEEE transactions on pattern analysis and machine intelligence, 29(6):1052–1067, 2007.
- Superpoint: Self-supervised interest point detection and description. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 337–33712, 2017. https://api.semanticscholar.org/CorpusID:4918026.
- Lsd-slam: Large-scale direct monocular slam. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pages 834–849. Springer, 2014.
- Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625, 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Computer Vision and Pattern Recognition, 2022.
- Honnotate: A method for 3d annotation of hand and object poses. In Computer Vision and Pattern Recognition, 2020.
- Multiple View Geometry in Computer Vision. Cambridge University Press, USA, 2 edition, 2003. ISBN 0521540518.
- Self-calibrating neural radiance fields. In International Conference on Computer Vision, 2021.
- Real-time perception meets reactive motion generation. IEEE Robotics and Automation Letters, 3(3):1864–1871, 2018. 10.1109/LRA.2018.2795645.
- Parallel tracking and mapping for small ar workspaces. In 2007 6th IEEE and ACM international symposium on mixed and augmented reality, pages 225–234. IEEE, 2007.
- Neroic: Neural rendering of objects from online image collections. ACM Trans. Graph., 41(4), jul 2022. ISSN 0730-0301. 10.1145/3528223.3530177. https://doi.org/10.1145/3528223.3530177.
- Cosypose: Consistent multi-view multi-object 6d pose estimation. In European Conference on Computer Vision, 2020.
- Megapose: 6d pose estimation of novel objects via render & compare. arXiv preprint arXiv:2212.06870, 2022.
- Barf: Bundle-adjusting neural radiance fields. In IEEE International Conference on Computer Vision (ICCV), 2021.
- Parallel inversion of neural radiance fields for robust pose estimation. In ICRA, 2023.
- Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images. In European Conference on Computer Vision, pages 298–315. Springer, 2022.
- Marching cubes: A high-resolution 3d surface construction algorithm. Computer Graphics, 21(4):163–169, 1987. 10.1145/37402.37422. https://doi.org/10.1145/37402.37422.
- David G. Lowe. Object recognition from local scale-invariant features. International Conference on Computer Vision (ICCV), pages 1150–1157, 1999. 10.1109/ICCV.1999.790410. https://www.cs.ubc.ca/~lowe/papers/iccv99.pdf.
- Pose estimation for augmented reality: A hands-on survey. IEEE Transactions on Visualization and Computer Graphics, 22(12):2633–2651, 2016. 10.1109/TVCG.2015.2513408.
- Fusion++: Volumetric object-level slam. In 2018 international conference on 3D vision (3DV), pages 32–41. IEEE, 2018.
- GNeRF: GAN-based Neural Radiance Field without Posed Camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Symmetry and uncertainty-aware object slam for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14901–14910, 2022.
- Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, 2020.
- Seeing behind objects for 3d multi-object tracking in rgb-d sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6071–6080, 2021.
- Extracting Triangular 3D Models, Materials, and Lighting From Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8280–8290, June 2022.
- Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics, 33(5):1255–1262, 2017.
- Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163, 2015.
- Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In International Conference on Computer Vision (ICCV), 2021.
- Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10710–10719, 2020.
- Simtrack: A simulation-based framework for scalable real-time object pose detection and tracking. In International Conference on Intelligent Robots and Systems, 2015.
- General in-hand object rotation with vision and touch. In 7th Annual Conference on Robot Learning, 2023. https://openreview.net/forum?id=RN00jfIV-X.
- Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In International Conference on Computer Vision, 2021.
- Nerf-slam: Real-time dense monocular slam with neural radiance fields. arXiv preprint arXiv:2210.13641, 2022.
- Maskfusion: Real-time recognition, tracking and reconstruction of multiple moving objects. In 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 10–20. IEEE, 2018.
- Slam++: Simultaneous localisation and mapping at the level of objects. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1352–1359, 2013.
- From coarse to fine: Robust hierarchical localization at large scale. In CVPR, 2019.
- Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
- Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
- Compositional and scalable object slam. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 11626–11632. IEEE, 2021.
- Flowcam: Training generalizable 3d radiance fields without camera poses via pixel-aligned scene flow, 2023.
- Iterative corresponding geometry: Fusing region and depth for highly efficient 3d tracking of textureless objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6855–6865, 2022.
- NeuralRecon: Real-time coherent 3D reconstruction from monocular video. CVPR, 2021.
- Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of the european conference on computer vision (ECCV), pages 699–715, 2018.
- Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II, page 402–419, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58535-8. 10.1007/978-3-030-58536-5_24. https://doi.org/10.1007/978-3-030-58536-5_24.
- DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. Advances in neural information processing systems, 2021.
- Sparf: Neural radiance fields from sparse and noisy poses. In Computer Vision and Pattern Recognition, 2023.
- NeuralDiff: Segmenting 3D objects that move in egocentric videos. In Proceedings of the International Conference on 3D Vision (3DV), 2021.
- Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(04):376–380, 1991.
- Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2642–2651, 2019.
- Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment. In International Conference on Computer Vision, 2023.
- Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021a.
- Tartanvo: A generalizable learning-based vo. 2020.
- NeRF−−--- -: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021b.
- Bundletrack: 6d pose tracking for novel objects without instance or category-level 3d models. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 8067–8074. IEEE Press, 2021. 10.1109/IROS51168.2021.9635991. https://doi.org/10.1109/IROS51168.2021.9635991.
- Catgrasp: Learning category-level task-relevant grasping in clutter from simulation. ICRA 2022, 2022a.
- You only demonstrate once: Category-level manipulation from single visual demonstration. ArXiv, abs/2201.12716, 2022b. https://api.semanticscholar.org/CorpusID:246430152.
- Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. Computer Vision and Pattern Recognition, 2023.
- Sinerf: Sinusoidal neural radiance fields for joint pose estimation and scene reconstruction. In 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. BMVA Press, 2022. https://bmvc2022.mpi-inf.mpg.de/0131.pdf.
- Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. In Robotics: Science and Systems (RSS), 2018.
- Track anything: Segment anything meets videos, 2023.
- Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems, 33, 2020.
- Volume rendering of neural implicit surfaces. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
- iNeRF: Inverting neural radiance fields for pose estimation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.
- pixelNeRF: Neural radiance fields from one or few images. In CVPR, 2021.
- NeRS: Neural reflectance surfaces for sparse-view 3d reconstruction in the wild. In Conference on Neural Information Processing Systems, 2021.
- The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 10.1109/CVPR.2018.00068.
- A tutorial on quantitative trajectory evaluation for visual (-inertial) odometry. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7244–7251. IEEE, 2018.
- Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. In European conference on computer vision (ECCV), 2022.
- Direct sparse mapping. IEEE Transactions on Robotics, 36(4):1363–1370, 2020.