Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views (2308.14078v2)
Abstract: Reconstructing 3D objects from extremely sparse views is a long-standing and challenging problem. While recent techniques employ image diffusion models for generating plausible images at novel viewpoints or for distilling pre-trained diffusion priors into 3D representations using score distillation sampling (SDS), these methods often struggle to simultaneously achieve high-quality, consistent, and detailed results for both novel-view synthesis (NVS) and geometry. In this work, we present Sparse3D, a novel 3D reconstruction method tailored for sparse view inputs. Our approach distills robust priors from a multiview-consistent diffusion model to refine a neural radiance field. Specifically, we employ a controller that harnesses epipolar features from input views, guiding a pre-trained diffusion model, such as Stable Diffusion, to produce novel-view images that maintain 3D consistency with the input. By tapping into 2D priors from powerful image diffusion models, our integrated model consistently delivers high-quality results, even when faced with open-world objects. To address the blurriness introduced by conventional SDS, we introduce the category-score distillation sampling (C-SDS) to enhance detail. We conduct experiments on CO3DV2 which is a multi-view dataset of real-world objects. Both quantitative and qualitative evaluations demonstrate that our approach outperforms previous state-of-the-art works on the metrics regarding NVS and geometry reconstruction.
- Generative Novel View Synthesis with 3D-Aware Diffusion Models. CoRR, abs/2304.02602.
- Stereo Radiance Fields (SRF): Learning View Synthesis from Sparse Views of Novel Scenes. In IEEE (CVPR).
- Depth-supervised NeRF: Fewer Views and Faster Training for Free. In IEEE CVPR, 12872–12881.
- Image Quality Assessment: Unifying Structure and Texture Similarity. IEEE TPAMI., 44(5): 2567–2581.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618.
- NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion. CoRR, abs/2302.10109.
- threestudio: A unified framework for 3D content generation. https://github.com/threestudio-project/threestudio.
- Deep Residual Learning for Image Recognition. In IEEE CVPR, 770–778.
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 6626–6637.
- Zero-Shot Text-Guided Object Generation with Dream Fields. In IEEE CVPR, 857–866.
- Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis. In ICCV, 5865–5874.
- ViewFormer: NeRF-Free Neural Rendering from Few Images Using Transformers. In ECCV, volume 13675, 198–216.
- Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1931–1941.
- NerfAcc: Efficient Sampling Accelerates NeRFs. CoRR, abs/2305.04966.
- Neuralangelo: High-Fidelity Neural Surface Reconstruction. In IEEE CVPR.
- Magic3D: High-Resolution Text-to-3D Content Creation. In IEEE CVPR.
- Zero-1-to-3: Zero-shot One Image to 3D Object. arXiv:2303.11328.
- Marching cubes: A high resolution 3D surface construction algorithm. In Stone, M. C., ed., SIGGRAPH, 163–169.
- The Contextual Loss for Image Transformation with Non-aligned Data. In ECCV, volume 11218, 800–815.
- RealFusion: 360 Reconstruction of Any Object from a Single Image. In IEEE CVPR.
- NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV, 405–421.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG, 41(4): 102:1–102:15.
- Point-E: A System for Generating 3D Point Clouds from Complex Prompts. abs/2212.08751.
- RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs. In IEEE CVPR, 5470–5480.
- Do 2D GANs Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image GANs. In ICLR.
- DreamFusion: Text-to-3D using 2D Diffusion. In ICLR.
- Learning Transferable Visual Models From Natural Language Supervision. In ICML, volume 139, 8748–8763.
- Zero-Shot Text-to-Image Generation. In ICML, volume 139, 8821–8831.
- Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. In ICCV.
- Dense Depth Priors for Neural Radiance Fields from Sparse Input Views. In IEEE CVPR, 12882–12891.
- High-Resolution Image Synthesis with Latent Diffusion Models. In IEEE CVPR, 10674–10685.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22500–22510.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In NeurIPS.
- Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations. In IEEE CVPR, 6219–6228.
- Structure-from-Motion Revisited. In IEEE CVPR, 4104–4113.
- Pixelwise View Selection for Unstructured Multi-View Stereo. In ECCV, volume 9907, 501–518.
- LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS.
- Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation. abs/2303.07937.
- Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis. In NeurIPS.
- Generalizable Patch-Based Neural Rendering. In ECCV, volume 13692, 156–174.
- Light Field Neural Rendering. In IEEE CVPR, 8259–8269.
- Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior. arXiv preprint arXiv:2303.14184.
- Neural Discrete Representation Learning. In NeurIPS, 6306–6315.
- Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields. In IEEE CVPR, 5481–5490.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers.
- CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields. In IEEE CVPR, 3825–3834.
- Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation. IEEE CVPR.
- NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. In NeurIPS, 27171–27183.
- IBRNet: Learning Multi-View Image-Based Rendering. In IEEE CVPR, 4690–4699.
- ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. CoRR, abs/2305.16213.
- FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization. In IEEE CVPR.
- MVSNet: Depth Inference for Unstructured Multi-view Stereo. In ECCV, 785–801.
- Volume Rendering of Neural Implicit Surfaces. In NeurIPS.
- DreamSparse: Escaping from Plato’s Cave with 2D Frozen Diffusion Model Given Sparse Views. CoRR, abs/2306.03414.
- pixelNeRF: Neural Radiance Fields From One or Few Images. In IEEE CVPR, 4578–4587.
- Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation. abs/2307.13908.
- Fast-MVSNet: Sparse-to-Dense Multi-View Stereo With Learned Propagation and Gauss-Newton Refinement. In IEEE CVPR, 1946–1955.
- MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface Reconstruction. In NeurIPS.
- Adding Conditional Control to Text-to-Image Diffusion Models. arXiv:2302.05543.
- The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In IEEE CVPR, 586–595.
- SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction. In CVPR.
- Zi-Xin Zou (11 papers)
- Weihao Cheng (9 papers)
- Yan-Pei Cao (58 papers)
- Shi-Sheng Huang (9 papers)
- Ying Shan (252 papers)
- Song-Hai Zhang (41 papers)