Visual Geometry Grounded Deep Structure From Motion (2312.04563v1)
Abstract: Structure-from-motion (SfM) is a long-standing problem in the computer vision community, which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints, registering images, triangulating 3D points, and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g., keypoint matching), but are still based on the original, non-differentiable pipeline. Instead, we propose a new deep pipeline VGGSfM, where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end, we introduce new mechanisms and simplifications. First, we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks, which eliminates the need for chaining pairwise matches. Furthermore, we recover all cameras simultaneously based on the image and track features instead of gradually registering cameras. Finally, we optimise the cameras and triangulate 3D points via a differentiable bundle adjustment layer. We attain state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D.
- Bundle adjustment in the large. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part II 11, pages 29–42. Springer, 2010.
- Building rome in a day. Communications of the ACM, 54(10):105–112, 2011.
- Ceres Solver, 2022.
- Global motion estimation from point matches. In 2012 Second international conference on 3D imaging, modeling, processing, visualization & transmission, pages 81–88. IEEE, 2012.
- Speeded-Up Robust Features (SURF). CVIU, 110(3), 2008.
- Neural-guided ransac: Learning where to sample model hypotheses. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4322–4331, 2019.
- Dsac-differentiable ransac for camera localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6684–6692, 2017.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Structure from Motion in the Geosciences. John Wiley & Sons, 2016.
- Deterministic edge-preserving regularization in computed imaging. IEEE Trans. Image Processing, 6:298–311, 1997.
- Tensorf: Tensorial radiance fields. In European Conference on Computer Vision, pages 333–350. Springer, 2022a.
- Learning to match features with seeded graph matching network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6301–6310, 2021.
- Aspanformer: Detector-free image matching with adaptive span transformer. In European Conference on Computer Vision, pages 20–36. Springer, 2022b.
- Sfm with mrfs: Discrete-continuous optimization for large-scale structure from motion. IEEE transactions on pattern analysis and machine intelligence, 35(12):2841–2853, 2012.
- Hsfm: Hybrid structure-from-motion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1212–1221, 2017.
- Global structure-from-motion by similarity averaging. In Proceedings of the IEEE International Conference on Computer Vision, pages 864–872, 2015.
- Linear global translation estimation with feature tracks. arXiv preprint arXiv:1503.01832, 2015.
- Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018.
- Tap-vid: A benchmark for tracking any point in a video. In NeurIPS Datasets Track, 2022.
- Tapir: Tracking any point with per-frame initialization and temporal refinement. arXiv preprint arXiv:2306.08637, 2023.
- D2-net: A trainable cnn for joint description and detection of local features. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 8092–8101, 2019.
- Multi-view optimization of local feature geometry. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 670–686. Springer, 2020.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
- Building rome on a cloudless day. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pages 368–381. Springer, 2010.
- Towards Internet-scale multi-view stereo. In Proc. CVPR. IEEE, 2010.
- Kubric: a scalable dataset generator. 2022.
- Particle video revisited: Tracking through occlusions using point trajectories. In ECCV, 2022.
- Multiple View Geometry in Computer Vision. Cambridge University Press, 2000.
- Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Detector-free structure from motion. In arxiv, 2023.
- Reconstructing the world* in six days *(as captured by the yahoo 100 million image dataset). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Structure from motion photogrammetry in forestry: A review. Current Forestry Reports, 5:155–168, 2019.
- Few-view object reconstruction with unknown categories and camera poses. ArXiv, 2212.04492, 2022.
- A global linear method for camera pose registration. In Proceedings of the IEEE international conference on computer vision, pages 481–488, 2013.
- Efficient structure from motion for large-scale uav images: A review and a comparison of sfm tools. ISPRS Journal of Photogrammetry and Remote Sensing, 167:230–251, 2020.
- Cotr: Correspondence transformer for matching across images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6207–6217, 2021.
- Image matching across wide baselines: From paper to practice. International Journal of Computer Vision, 129(2):517–547, 2021.
- CoTracker: It is better to track together. 2023.
- What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? Proc. NeurIPS, 2017.
- 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
- Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018.
- Relpose++: Recovering 6d poses from sparse-view observations. arXiv preprint arXiv:2305.04926, 2023.
- Barf: Bundle-adjusting neural radiance fields. In IEEE International Conference on Computer Vision (ICCV), 2021.
- Microsoft COCO: Common Objects in Context. In Proc. ECCV, 2014.
- Pixel-Perfect Structure-from-Motion with Featuremetric Refinement. arXiv.cs, abs/2108.08291, 2021.
- Lightglue: Local feature matching at light speed. arXiv preprint arXiv:2306.13643, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Matchminer: Efficient spanning structure mining in large image collections. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part II 12, pages 45–58. Springer, 2012.
- David G. Lowe. Object Recognition from Local Scale-Invariant Features. In Proc. ICCV, 1999.
- David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 60(2), 2004.
- Xiao Xin Lu. A review of solutions for perspective-n-point problem in camera pose estimation. In Journal of Physics: Conference Series, page 052009. IOP Publishing, 2018.
- Virtual correspondence: Humans as a cue for extreme-view geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15924–15934, 2022.
- Robust Wide Baseline Stereo from Maximally Stable Extremal Regions. In Proc. BMVC, 2002.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Proc. ECCV, 2020.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Jorge J Moré. The levenberg-marquardt algorithm: implementation and theory. In Numerical analysis: proceedings of the biennial Conference held at Dundee, June 28–July 1, 1977, pages 105–116. Springer, 2006.
- Global fusion of relative motions for robust, accurate and scalable structure from motion. In Proceedings of the IEEE international conference on computer vision, pages 3248–3255, 2013.
- Learning 3d object categories by looking around them. In Proc. ICCV, 2017.
- John Oliensis. A critique of structure-from-motion algorithms. Computer Vision and Image Understanding, 80(2):172–214, 2000.
- Robust camera location estimation by convex programming. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2674–2683, 2015.
- A survey of structure from motion*. Acta Numerica, 26:305–364, 2017.
- Theseus: A library for differentiable nonlinear optimization. Advances in Neural Information Processing Systems, 35:3801–3818, 2022.
- Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10901–10911, 2021.
- Rother. Linear multiview reconstruction of points, lines, planes and cameras using a reference plane. In Proceedings Ninth IEEE International Conference on Computer Vision, pages 1210–1217. IEEE, 2003.
- Particle video: Long-range motion estimation using point trajectories. International journal of computer vision, 80:72–91, 2008.
- Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
- Multi-view Matching for Unordered Image Sets, or ”How Do I Organize My Holiday Snaps?”. In Proc. ECCV, 2002.
- Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017.
- Clustergnn: Cluster-based coarse-to-fine graph neural network for efficient feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12517–12526, 2022.
- Sparsepose: Sparse-view camera pose regression and refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21349–21359, 2023.
- Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, pages 369–386. SPIE, 2019.
- Photo tourism: exploring photo collections in 3d. In ACM siggraph 2006 papers, pages 835–846. 2006.
- Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021.
- Optimizing the viewing graph for structure-from-motion. In Proceedings of the IEEE international conference on computer vision, pages 801–809, 2015.
- Ba-net: Dense bundle adjustment network. arXiv preprint arXiv:1806.04807, 2018.
- Deepv2d: Video to depth with differentiable structure from motion. arXiv preprint arXiv:1812.04605, 2018.
- Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
- Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021.
- Bundle Adjustment - A Modern Synthesis. In Proc. ICCV Workshop, 2000.
- Disk: Learning local features with policy gradient. Advances in Neural Information Processing Systems, 33:14254–14265, 2020.
- Demon: Depth and motion network for learning monocular stereo. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5038–5047, 2017.
- Attention is all you need. Proc. NeurIPS, 2017.
- Deep two-view structure-from-motion revisited. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 8953–8962, 2021a.
- Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9773–9783, 2023.
- Matchformer: Interleaving attention in transformers for feature matching. In Proceedings of the Asian Conference on Computer Vision, pages 2746–2762, 2022.
- NeRF−−--- -: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021b.
- Generalized differentiable ransac. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17649–17660, 2023.
- Deepsfm: Structure from motion via deep bundle adjustment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 230–247. Springer, 2020.
- ‘structure-from-motion’photogrammetry: A low-cost, effective tool for geoscience applications. Geomorphology, 179:300–314, 2012.
- Robust global translations with 1dsfm. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III 13, pages 61–75. Springer, 2014.
- Changchang Wu. Towards linear-time incremental structure from motion. In 2013 International Conference on 3D Vision-3DV 2013, pages 127–134. IEEE, 2013.
- Unsupervised learning of probably symmetric deformable 3d objects from images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1–10, 2020.
- MagicPony: Learning articulated 3d animals in the wild. 2023.
- LIFT: Learned Invariant Feature Transform. In Proc. ECCV, 2016.
- Relpose: Predicting probabilistic relative rotation for single objects in the wild. In ECCV, pages 592–611. Springer, 2022.
- Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19855–19865, 2023.
- Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017.
- Jianyuan Wang (24 papers)
- Nikita Karaev (5 papers)
- Christian Rupprecht (90 papers)
- David Novotny (42 papers)