The More You See in 2D, the More You Perceive in 3D (2404.03652v1)
Abstract: Humans can infer 3D structure from 2D images of an object based on past experience and improve their 3D understanding as they see more images. Inspired by this behavior, we introduce SAP3D, a system for 3D reconstruction and novel view synthesis from an arbitrary number of unposed images. Given a few unposed images of an object, we adapt a pre-trained view-conditioned diffusion model together with the camera poses of the images via test-time fine-tuning. The adapted diffusion model and the obtained camera poses are then utilized as instance-specific priors for 3D reconstruction and novel view synthesis. We show that as the number of input images increases, the performance of our approach improves, bridging the gap between optimization-based prior-less 3D reconstruction methods and single-image-to-3D diffusion-based methods. We demonstrate our system on real images as well as standard synthetic benchmarks. Our ablation studies confirm that this adaption behavior is key for more accurate 3D understanding.
- 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
- Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21126–21136, 2022.
- Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022.
- Google scanned objects: A high-quality dataset of 3d scanned household items, 2022.
- Topologically-aware deformation fields for single-view 3d reconstruction. CVPR, 2022.
- Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2009.
- Multi-view stereo: A tutorial. Foundations and Trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015.
- Mesh r-cnn. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9785–9795, 2019.
- Shape and viewpoint without keypoints. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pages 88–104. Springer, 2020.
- Differentiable stereopsis: Meshes from multiple views using differentiable rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8635–8644, 2022.
- AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018.
- Multiple View Geometry in Computer Vision. Cambridge University Press, 2 edition, 2004.
- Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5885–5894, 2021.
- Learning category-specific mesh reconstruction from image collections. In Proceedings of the European Conference on Computer Vision (ECCV), pages 371–386, 2018.
- Category-specific object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1966–1974, 2015.
- Learning a multi-view stereo machine. In Proceedings of the 31st International Conference on Neural Information Processing Systems, page 364–375, Red Hook, NY, USA, 2017. Curran Associates Inc.
- Tanks and temples. ACM Transactions on Graphics (TOG), 36:1 – 13, 2017.
- Multi-concept customization of text-to-image diffusion. 2023.
- Self-supervised single-view 3d reconstruction via semantic consistency. In ECCV, 2020.
- Relpose++: Recovering 6d poses from sparse-view observations. arXiv preprint arXiv:2305.04926, 2023.
- One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization, 2023a.
- One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023b.
- Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023c.
- Syncdreamer: Learning to generate multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023d.
- Occupancy networks: Learning 3d reconstruction in function space. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020a.
- NeRF: Representing scenes as neural radiance fields for view synthesis. In The European Conference on Computer Vision (ECCV), 2020b.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, 2022.
- Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2022.
- Mystyle: A personalized generative prior. arXiv preprint arXiv:2203.17272, 2022.
- Visual modeling with a hand-held camera. International Journal of Computer Vision, 59:207–232, 2004.
- Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
- Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
- Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508, 2023.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10901–10911, 2021.
- Pivotal tuning for latent-based editing of real images. CoRR, abs/2106.05744, 2021.
- High-resolution image synthesis with latent diffusion models, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. arXiv preprint arXiv:1905.05172, 2019.
- Structure-from-Motion Revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
- Mvdream: Multi-view diffusion for 3d generation, 2023.
- SparsePose: Sparse-view camera pose regression and refinement. In Computer Vision and Pattern Recognition (CVPR), 2023.
- Scene representation networks: Continuous 3d-structure-aware neural scene representations. In Advances in Neural Information Processing Systems, 2019.
- Vip-nerf: Visibility prior for sparse input neural radiance fields. In ACM Special Interest Group on Computer Graphics and Interactive Techniques (SIGGRAPH), 2023.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023a.
- Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9773–9783, 2023b.
- Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018.
- Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022.
- Dove: Learning deformable 3d objects by watching videos. International Journal of Computer Vision, pages 1–12, 2023.
- Volume rendering of neural implicit surfaces. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
- pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
- NeRS: Neural reflectance surfaces for sparse-view 3d reconstruction in the wild. In Conference on Neural Information Processing Systems, 2021.
- RelPose: Predicting probabilistic relative rotation for single objects in the wild. In European Conference on Computer Vision (ECCV), 2022.
- The unreasonable effectiveness of deep features as a perceptual metric. CoRR, abs/1801.03924, 2018a.
- Learning to Reconstruct Shapes From Unseen Classes. In Advances in Neural Information Processing Systems (NeurIPS), 2018b.
- Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In CVPR, 2023.