MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation (2404.03656v1)
Abstract: We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images. While recent methods pursuing 3D inference advocate learning novel-view generative models, these generations are not 3D-consistent and require a distillation process to generate a 3D output. We instead cast the task of 3D inference as directly generating mutually-consistent multiple views and build on the insight that additionally inferring depth can provide a mechanism for enforcing this consistency. Specifically, we train a denoising diffusion model to generate multi-view RGB-D images given a single RGB input image and leverage the (intermediate noisy) depth estimates to obtain reprojection-based conditioning to maintain multi-view consistency. We train our model using large-scale synthetic dataset Obajverse as well as the real-world CO3D dataset comprising of generic camera viewpoints. We demonstrate that our approach can yield more accurate synthesis compared to recent state-of-the-art, including distillation-based 3D inference and prior multi-view generation methods. We also evaluate the geometry induced by our multi-view depth prediction and find that it yields a more accurate representation than other direct 3D inference approaches.
- 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV, 2016.
- Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
- Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In CVPR, 2023.
- Google scanned objects: A high-quality dataset of 3d scanned household items. In ICRA, 2022.
- A point set generation network for 3d object reconstruction from a single image. In CVPR, 2017.
- Learning a predictable and generative vector representation for objects. In ECCV, 2016.
- Mesh r-cnn. In ICCV, 2019.
- Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In ICML, 2023.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Epidiff: Enhancing multi-view synthesis via localized epipolar-constrained diffusion. In CVPR, 2024.
- Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
- Learning category-specific mesh reconstruction from image collections. In ECCV, 2018.
- Spad : Spatially aware multiview diffusers. In CVPR, 2024.
- Directed ray distance functions for 3d scene reconstruction. In ECCV, 2022.
- Learning to predict scene-level implicit 3d from posed rgbd data. In CVPR, 2023.
- Sdf-srn: Learning signed distance 3d object reconstruction from static images. In NeurIPS, 2020.
- One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. In NeurIPS, 2023a.
- One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In CVPR, 2024a.
- Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023b.
- Syncdreamer: Learning to generate multiview-consistent images from a single-view image. In ICLR, 2024b.
- Wonder3d: Single image to 3d using cross-domain diffusion. In CVPR, 2024.
- Realfusion: 360deg reconstruction of any object from a single image. In CVPR, 2023.
- Occupancy networks: Learning 3d reconstruction in function space. In CVPR, 2019.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- From image collections to point clouds with self-supervised shape and pose networks. In CVPR, 2020.
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
- Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
- Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. In ICLR, 2024.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261, 2023.
- Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023.
- Mvdream: Multi-view diffusion for 3d generation. In ICLR, 2024.
- Viewset diffusion:(0-) image-conditioned 3d generative models from 2d data. In ICCV, 2023.
- Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In ICCV, 2023a.
- Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. In NeurIPS, 2023b.
- Diffusion with forward models: Solving stochastic inverse problems without direct supervision. In NeurIPS, 2023.
- Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.
- Pre-train, self-train, distill: A simple recipe for supersizing 3d reconstruction. In CVPR, 2022.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023.
- Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018.
- Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NeurIPS, 2021.
- Image quality assessment: from error visibility to structural similarity. In TIP, 2004.
- Crm: Single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint arXiv:2403.05034, 2024.
- Multiview compressive coding for 3d reconstruction. In CVPR, 2023.
- Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views. In CVPR, 2023.
- Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. In NeurIPS, 2019.
- Shelf-supervised mesh prediction in the wild. In CVPR, 2021.
- pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021.
- RelPose: Predicting probabilistic relative rotation for single objects in the wild. In ECCV, 2022.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In CVPR, 2023.
- Hanzhe Hu (7 papers)
- Zhizhuo Zhou (3 papers)
- Varun Jampani (125 papers)
- Shubham Tulsiani (71 papers)