Papers
Topics
Authors
Recent
Search
2000 character limit reached

DUSt3R: Geometric 3D Vision Made Easy

Published 21 Dec 2023 in cs.CV | (2312.14132v3)

Abstract: Multi-view stereo reconstruction (MVS) in the wild requires to first estimate the camera parameters e.g. intrinsic and extrinsic parameters. These are usually tedious and cumbersome to obtain, yet they are mandatory to triangulate corresponding pixels in 3D space, which is the core of all best performing MVS algorithms. In this work, we take an opposite stance and introduce DUSt3R, a radically novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction of arbitrary image collections, i.e. operating without prior information about camera calibration nor viewpoint poses. We cast the pairwise reconstruction problem as a regression of pointmaps, relaxing the hard constraints of usual projective camera models. We show that this formulation smoothly unifies the monocular and binocular reconstruction cases. In the case where more than two images are provided, we further propose a simple yet effective global alignment strategy that expresses all pairwise pointmaps in a common reference frame. We base our network architecture on standard Transformer encoders and decoders, allowing us to leverage powerful pretrained models. Our formulation directly provides a 3D model of the scene as well as depth information, but interestingly, we can seamlessly recover from it, pixel matches, relative and absolute camera. Exhaustive experiments on all these tasks showcase that the proposed DUSt3R can unify various 3D vision tasks and set new SoTAs on monocular/multi-view depth estimation as well as relative pose estimation. In summary, DUSt3R makes many geometric 3D vision tasks easy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (188)
  1. Large-scale data for multiple-view stereopsis. IJCV, 2016.
  2. Planeformers: From sparse view planes to 3d reconstruction. In ECCV, volume 13663 of Lecture Notes in Computer Science, pages 192–209, 2022.
  3. Affineglue: Joint matching and robust estimation, 2023.
  4. Key. net: Keypoint detection by handcrafted and learned cnn filters. In ICCV, pages 5836–5844, 2019.
  5. Surf: Speeded up robust features. In ECCV, pages 404–417. Springer, 2006.
  6. Auto-rectify network for unsupervised indoor depth estimation. IEEE Trans. Pattern Anal. Mach. Intell., 44(12):9802–9813, 2022.
  7. CodeSLAM - learning a compact, optimisable representation for dense visual SLAM. In CVPR, 2018.
  8. DSAC - differentiable RANSAC for camera localization. In CVPR, 2017.
  9. Learning less is more - 6d camera localization via 3d surface regression. In CVPR, 2018.
  10. Neural-guided RANSAC: learning where to sample model hypotheses. In ICCV, pages 4321–4330. IEEE, 2019.
  11. Visual camera re-localization from RGB and RGB-D images using DSAC. PAMI, 2022.
  12. Using multiple hypotheses to improve depth-maps for multi-view stereo. In ECCV, 2008.
  13. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 2021.
  14. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], 2015.
  15. Learning to explore using active neural slam. arXiv preprint arXiv:2004.05155, 2020.
  16. Deep photometric stereo for non-lambertian surfaces. PAMI, 2022.
  17. Aspanformer: Detector-free image matching with adaptive span transformer. ECCV, 2022.
  18. Deep stereo using adaptive thin volume representation with uncertainty awareness. In CVPR, 2020.
  19. 3D-R2N2: A unified approach for single and multi-view 3d object reconstruction. In ECCV, 2016.
  20. SfM with MRFs: Discrete-continuous optimization for large-scale structure from motion. PAMI, 2013.
  21. Hsfm: Hybrid structure-from-motion. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
  22. DeepFactors: Real-time probabilistic dense monocular SLAM. IEEE Robotics Autom. Lett., 5(2):721–728, 2020.
  23. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
  24. Dense reconstruction using 3d object shape priors. In CVPR, pages 1288–1295, 2013.
  25. ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. In NeurIPS Datasets and Benchmarks, 2021.
  26. Superpoint: Self-supervised interest point detection and description. In CVPR Workshops, pages 224–236, 2018.
  27. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  28. D2-net: A trainable CNN for joint description and detection of local features. In CVPR, pages 8092–8101, 2019.
  29. A point set generation network for 3d object reconstruction from a single image. In CVPR, 2017.
  30. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 1981.
  31. Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction. In NeurIPS, 2022.
  32. Multi-view stereo: A tutorial. Found. Trends Comput. Graph. Vis., 2015.
  33. Accurate, dense, and robust multiview stereopsis. PAMI, 2010.
  34. Massively parallel multiview stereopsis by surface normal diffusion. In ICCV, June 2015.
  35. Vision meets robotics: The KITTI dataset. Int. J. Robotics Res., 32(11):1231–1237, 2013.
  36. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.
  37. Digging into self-supervised monocular depth estimation. In ICCV, pages 3827–3837. IEEE, 2019.
  38. 3d reconstruction methods for digital preservation of cultural heritage: A survey. Pattern Recognit. Lett., 2014.
  39. Atlasnet: A papier-mâché approach to learning 3d surface generation. CVPR, 2018.
  40. 3d packing for self-supervised monocular depth estimation. In CVPR, pages 2482–2491, 2020.
  41. Neural 3d scene reconstruction with the manhattan-world assumption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5511–5520, 2022.
  42. A combined corner and edge detector. In Alvey vision conference, volume 15, pages 10–5244. Citeseer, 1988.
  43. Multiple view geometry in computer vision. Cambridge university press, 2003.
  44. Multiple View Geometry in Computer Vision. Cambridge University Press, 2004.
  45. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  46. Neo 360: Neural fields for sparse view synthesis of outdoor scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9187–9198, 2023.
  47. A global linear method for camera pose registration. In ICCV, 2013.
  48. PoseNet: a Convolutional Network for Real-Time 6-DOF Camera Relocalization. In ICCV, 2015.
  49. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics, 36(4), 2017.
  50. Tanks and Temples Online Benchmark. https://www.tanksandtemples.org/leaderboard/, 2017. [Online; accessed 19-October-2023].
  51. Epnp: An accurate O(n) solution to the pnp problem. IJCV, 2009.
  52. Volume sweeping: Learning photoconsistency for multi-view shape reconstruction. IJCV, 2021.
  53. Efficient dense point cloud object reconstruction using deformation vector fields. In ECCV, 2018.
  54. Hierarchical scene coordinate classification and regression for visual localization. In CVPR, 2020.
  55. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, pages 2041–2050, 2018.
  56. Learning efficient point cloud generation for dense 3d object reconstruction. In AAAI, 2018.
  57. BARF: bundle-adjusting neural radiance fields. In ICCV, 2021.
  58. Pixel-perfect structure-from-motion with featuremetric refinement. In ICCV, 2021.
  59. Lightglue: Local feature matching at light speed. In ICCV, 2023.
  60. Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2019–2028, 2020.
  61. Decoupled weight decay regularization. In ICLR, 2019.
  62. David G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
  63. Procrustes alignment with the EM algorithm. In Computer Analysis of Images and Patterns, CAIP, volume 1689 of Lecture Notes in Computer Science, pages 623–631. Springer, 1999.
  64. Multiview stereo with cascaded epipolar raft. In ECCV, 2022.
  65. 3d-lmnet: Latent embedding matching for accurate and diverse 3d point cloud reconstruction from a single image. In BMVC, 2018.
  66. Dense 3d point cloud reconstruction using a deep pyramid network. In WACV, 2019.
  67. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7210–7219, 2021.
  68. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, 2016.
  69. Gnerf: Gan-based neural radiance field without posed camera. In ICCV, 2021.
  70. Neat: Learning neural implicit surfaces with arbitrary topologies from multi-view images. In CVPR, 2023.
  71. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  72. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 2015.
  73. DTAM: dense tracking and mapping in real-time. In ICCV, pages 2320–2327, 2011.
  74. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3504–3515, 2020.
  75. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In CVPR, 2020.
  76. UNISURF: unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In ICCV, 2021.
  77. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  78. A survey of structure from motion*. Acta Numerica, 26:305–364, 2017.
  79. Refusion: 3d reconstruction in dynamic environments for RGB-D cameras exploiting residuals. In 2IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7855–7862, 2019.
  80. Deepsdf: Learning continuous signed distance functions for shape representation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  81. GlueStick: Robust image matching by sticking points and lines together. In ICCV, 2023.
  82. Learning generative models of textured 3d meshes from real-world images. In ICCV, 2021.
  83. Convolutional generation of textured 3d meshes. In NeurIPS, 2020.
  84. Shape, pose, and appearance from a single image via bootstrapped radiance field inversion. In CVPR, 2023.
  85. Rethinking depth estimation for multi-view stereo: A unified representation. In CVPR, 2022.
  86. Archaeological feature detection from archive aerial photography with a sfm-mvs and image enhancement pipeline. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 42:869–875, 2018.
  87. Frank Plastria. The Weiszfeld Algorithm: Proof, Amendments, and Extensions, pages 357–389. Springer US, 2011.
  88. Image2mesh: A learning framework for single image 3d reconstruction. In ACCV, 2018.
  89. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pages 77–85, 2017.
  90. Vision transformers for dense prediction. In ICCV, 2021.
  91. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. CoRR, 1907.01341/abs, 2020.
  92. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12240–12249, 2019.
  93. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, pages 10881–10891, 2021.
  94. Learning with average precision: Training image retrieval with a listwise loss. In ICCV, 2019.
  95. SACReg: Scene-agnostic coordinate regression for visual localization. CoRR, abs/2307.11702, 2023.
  96. R2D2: reliable and repeatable detector and descriptor. In Neurips, pages 12405–12415, 2019.
  97. Matryoshka networks: Predicting 3d geometry via nested shape layers. In CVPR, 2018.
  98. Machine learning for high-speed corner detection. In ECCV. Springer, 2006.
  99. Superglue: Learning feature matching with graph neural networks. In CVPR, pages 4937–4946, 2020.
  100. From coarse to fine: Robust hierarchical localization at large scale. In CVPR, 2019.
  101. Back to the Feature: Learning Robust Camera Localization from Pixels to Pose. In CVPR, 2021.
  102. Efficient & effective prioritized matching for large-scale image-based localization. IEEE trans. PAMI, 2017.
  103. Habitat: A platform for embodied ai research. In ICCV, 2019.
  104. R3D3: dense 3d reconstruction of dynamic scenes from multiple cameras. CoRR, abs/2308.14713, 2023.
  105. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  106. Pixelwise view selection for unstructured multi-view stereo. In ECCV, 2016.
  107. ETH3D Online Benchmark. https://www.eth3d.net/high_res_multi_view, 2017. [Online; accessed 19-October-2023].
  108. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In CVPR, 2017.
  109. A benchmark and a baseline for robust multi-view depth estimation. In 3DV, pages 637–645, 2022.
  110. Occluding contours for multi-view stereo. In CVPR, pages 4002–4009, 2014.
  111. 3d-retr: End-to-end single and multi-view 3d reconstruction with transformers. In BMVC, page 405, 2021.
  112. Pixels, voxels, and views: A study of shape representations for single view 3d object shape prediction. In CVPR, 2018.
  113. Scene coordinate regression forests for camera relocalization in RGB-D images. In CVPR, pages 2930–2937, 2013.
  114. Indoor segmentation and support inference from RGBD images. In ECCV, pages 746–760, 2012.
  115. Flowcam: Training generalizable 3d radiance fields without camera poses via pixel-aligned scene flow, 2023.
  116. Kick back & relax: Learning to reconstruct the world by watching slowtv. In ICCV, 2023.
  117. A divide et impera approach for 3d shape reconstruction from multiple views. In 3DV, 2020.
  118. A benchmark for the evaluation of RGB-D SLAM systems. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 573–580. IEEE, 2012.
  119. LoFTR: Detector-free local feature matching with transformers. CVPR, 2021.
  120. Sc-depthv3: Robust self-supervised monocular depth estimation for dynamic scenes. CoRR, 2211.03660, 2022.
  121. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, June 2020.
  122. Ba-net: Dense bundle adjustment network. Proceedings of the International Conference on Learning Representations, 2018.
  123. Learning camera localization via dense scene matching. In CVPR, 2021.
  124. Neumap: Neural coordinate mapping by auto-transdecoder for camera localization. In CVPR, 2023.
  125. Quadtree attention for vision transformers. ICLR, 2022.
  126. Multi-view 3d models from single images with a convolutional network. In ECCV, 2016.
  127. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In ICCV, 2017.
  128. What do single-view 3d reconstruction networks learn? In CVPR, 2019.
  129. CNN-SLAM: real-time dense monocular SLAM with learned depth prediction. In CVPR, 2017.
  130. Deepv2d: Video to depth with differentiable structure from motion. In ICLR, 2020.
  131. DROID-SLAM: deep visual SLAM for monocular, stereo, and RGB-D cameras. In NeurIPS, pages 16558–16569, 2021.
  132. Sebastian Thrun. Probabilistic robotics. Communications of the ACM, 45(3):52–57, 2002.
  133. Efficient large-scale multi-view stereo for ultra high-resolution image sets. Mach. Vis. Appl., 2012.
  134. Disk: Learning local features with policy gradient. Advances in Neural Information Processing Systems, 33:14254–14265, 2020.
  135. DeMoN: Depth and motion network for learning monocular stereo. In CVPR, pages 5622–5631, 2017.
  136. Confnet: Predict with confidence. In IEEE Intern. Conf.on Acoustics, Speech and Signal Processing (ICASSP), pages 2921–2925, 2018.
  137. Multi-view 3d reconstruction with transformers. In ICCV, pages 5702–5711, 2021.
  138. Patchmatchnet: Learned multi-view patchmatch stereo. In CVPR, pages 14194–14203, 2021.
  139. Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment. In ICCV, 2023.
  140. Mvpnet: Multi-view point regression networks for 3d object reconstruction from A single image. In AAAI, 2019.
  141. Deep two-view structure-from-motion revisited. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 8953–8962, 2021.
  142. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NeurIPS, 2021.
  143. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS, 2021.
  144. Guiding local feature matching with surface curvature. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17981–17991, October 2023.
  145. Hf-neus: Improved surface reconstruction using high-frequency details. In NeurIPS, 2022.
  146. Adaptive patch deformation for textureless-resilient multi-view stereo. In CVPR, 2023.
  147. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. In ICCV, 2021.
  148. CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow. In ICCV, 2023.
  149. CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion. In NeurIPS, 2022.
  150. Synsin: End-to-end view synthesis from a single image. In CVPR, 2020.
  151. Sc-wls: Towards interpretable feed-forward camera re-localization. In ECCV, 2022.
  152. Level-S22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTfM: Structure From Motion on Neural Level Set of Implicit Surfaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  153. Pix2vox: Context-aware 3d reconstruction from single and multi-view images. In ICCV, 2019.
  154. Pix2vox++: Multi-scale context-aware 3d object reconstruction from single and multiple images. IJCV, 2020.
  155. Frozenrecon: Pose-free 3d scene reconstruction with frozen depth models. In ICCV, 2023.
  156. Learning inverse depth regression for multi-view stereo with correlation cost volume. In AAAI, 2020.
  157. Cost volume pyramid based depth inference for multi-view stereo. In CVPR, pages 4876–4885, 2020.
  158. Sanet: Scene agnostic network for camera localization. In ICCV, 2019.
  159. MVS2D: efficient multiview stereo via attention-driven 2d convolutions. In CVPR, pages 8564–8574, 2022.
  160. Mvsnet: Depth inference for unstructured multi-view stereo. In ECCV, 2018.
  161. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In CVPR, pages 1787–1796, 2020.
  162. Multiview neural surface reconstruction by disentangling geometry and appearance. In NeurIPS, 2020.
  163. Constraining depth map geometry for multi-view stereo: A dual-depth approach with saddle-shaped depth cells. ICCV, 2023.
  164. Ec-sfm: Efficient covisibility-based structure-from-motion for both sequential and unordered images. CoRR, abs/2302.10544, 2023.
  165. Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
  166. Lift: Learned invariant feature transform. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pages 467–483. Springer, 2016.
  167. Metric3d: Towards zero-shot metric 3d prediction from a single image. In ICCV, 2023.
  168. Towards accurate reconstruction of 3d scene shape from a single monocular image, 2022.
  169. Towards accurate reconstruction of 3d scene shape from a single monocular image. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022.
  170. Learning to recover 3d scene shape from a single image. In CVPR, 2020.
  171. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1983–1992, 2018.
  172. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems, 35:25018–25032, 2022.
  173. Neural window fully-connected crfs for monocular depth estimation. In CVPR, pages 3906–3915, 2022.
  174. Zhaojie Zeng. OpenMVS. https://github.com/cdcseacave/openMVS, 2015. [Online; accessed 19-October-2023].
  175. Vis-mvsnet: Visibility-aware multi-view stereo network. Int. J. Comput. Vis., 131(1):199–214, 2023.
  176. Relpose: Predicting probabilistic relative rotation for single objects in the wild. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, ECCV, pages 592–611, 2022.
  177. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
  178. GO-SLAM: Global optimization for consistent 3d instant reconstruction. In ICCV, pages 3727–3737, October 2023.
  179. Geomvsnet: Learning multi-view stereo with geometry perception. In CVPR, 2023.
  180. Progressive correspondence pruning by consensus learning. In ICCV, 2021.
  181. MonoViT: Self-supervised monocular depth estimation with a vision transformer. In International Conference on 3D Vision (3DV), sep 2022.
  182. Geofill: Reference-based image inpainting with better geometric understanding. In WACV, pages 1776–1786, 2023.
  183. DeepTAM: Deep tracking and mapping with convolutional neural networks. Int. J. Comput. Vis., 128(3):756–769, 2020.
  184. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017.
  185. Stereo magnification: Learning view synthesis using multiplane images. ACM Trans. Graph. (Proc. SIGGRAPH), 37, 2018.
  186. Semantic photometric bundle adjustment on natural sequences. CoRR, 2017.
  187. Nicer-slam: Neural implicit scene encoding for rgb slam. arXiv preprint arXiv:2302.03594, 2023.
  188. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12786–12796, 2022.
Citations (136)

Summary

  • The paper introduces DUSt3R, which eliminates the need for explicit camera calibration by inferring parameters directly from image contents.
  • It employs a regression approach on pointmaps to directly decode rich geometric details for accurate multi-view depth estimation.
  • The architecture integrates multiple 3D vision tasks into one pipeline, achieving state-of-the-art performance on benchmarks like DTU, Tanks and Temples, and ETH-3D.

Introduction to DUSt3R

The paper introduces DUSt3R, a novel approach aimed at simplifying the complex tasks involved in geometric 3D vision. Traditional Multi-view Stereo Reconstruction (MVS) processes typically require careful estimation of camera parameters, which is cumbersome and error-prone. DUSt3R stands out by not needing such prior information, instead operating directly on the image contents to infer camera parameters, pixel correspondences, depthmaps, and create a full 3D reconstruction.

Unconstrained 3D Reconstruction

DUSt3R tackles the challenge of 3D reconstruction from images without predefined information about camera calibration or viewpoint poses. The system formulates pairwise reconstruction as regression of pointmaps, which diversifies from the standard projective camera models. The network efficiently decodes these pointmaps with rich geometric details from pairs of images, simplifying the process of extracting detailed scene geometry.

Seamless Integration of Multiple Tasks

An impressive feature of DUSt3R is its ability to unify various 3D vision tasks, traditionally handled separately, into a single, simplified pipeline. The architecture leverages pretrained models and a fully data-driven approach to learn powerful geometric and shape priors. This process results in direct 3D models of scenes that also lend themselves to tasks like pose estimation and monocular and multi-view reconstructions.

Flexibility and State-of-the-Art Performance

Experimentation with datasets such as DTU, Tanks and Temples, and ETH-3D demonstrates that DUSt3R works without known camera parameters. The network can handle monocular reconstruction and align multiple image pairs in a common reference frame. The evaluations position DUSt3R as setting new standards, achieving state-of-the-art results across a range of tasks including multi-view depth estimation and camera pose estimation.

Conclusion

The paper presents DUSt3R as an influential advancement in 3D geometric vision, offering a substantial simplification over traditional methods. The results underline its potential and versatility in handling diverse 3D vision challenges without the exhaustive and meticulous steps of estimating and calibrating camera parameters, which represents a significant stepping stone for future developments in the field.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 109 likes about this paper.