3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting (2405.18424v1)
Abstract: Scene image editing is crucial for entertainment, photography, and advertising design. Existing methods solely focus on either 2D individual object or 3D global scene editing. This results in a lack of a unified approach to effectively control and manipulate scenes at the 3D level with different levels of granularity. In this work, we propose 3DitScene, a novel and unified scene editing framework leveraging language-guided disentangled Gaussian Splatting that enables seamless editing from 2D to 3D, allowing precise control over scene composition and individual objects. We first incorporate 3D Gaussians that are refined through generative priors and optimization techniques. Language features from CLIP then introduce semantics into 3D geometry for object disentanglement. With the disentangled Gaussians, 3DitScene allows for manipulation at both the global and individual levels, revolutionizing creative expression and empowering control over scenes and objects. Experimental results demonstrate the effectiveness and versatility of 3DitScene in scene image editing. Code and online demo can be found at our project homepage: https://zqh0253.github.io/3DitScene/.
- Generative novel view synthesis with 3d-aware diffusion models. ICCV (2023).
- Scenetex: High-quality texture synthesis for indoor scenes via diffusion priors. arXiv preprint arXiv:2311.17261 (2023).
- Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481 (2023).
- Scenedreamer: Unbounded 3d scene generation from 2d image collections. arXiv preprint arXiv:2302.01330 (2023).
- Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384 (2023).
- Disentangled 3D Scene Generation with Layout Learning. arXiv preprint arXiv:2402.16936 (2024).
- Deepview: View synthesis with learned gradient descent. In CVPR.
- GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image. arXiv preprint arXiv:2403.12013 (2024).
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
- Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In ICML.
- Single-view view synthesis in the wild with learned adaptive multiplane images. In ACM SIGGRAPH Conference Proceedings.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).
- Style aligned image generation via shared attention. arXiv preprint arXiv:2312.02133 (2023).
- Denoising diffusion probabilistic models. (2020).
- Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).
- Text2room: Extracting textured 3d meshes from 2d text-to-image models. arXiv preprint arXiv:2303.11989 (2023).
- LRM: Large Reconstruction Model for Single Image to 3D. arXiv preprint arXiv:2311.04400 (2023).
- Worldsheet: Wrapping the world in a 3d sheet for view synthesis from a single image. In ICCV.
- OpenCLIP. https://doi.org/10.5281/zenodo.5143773 If you use this software, please cite it as below..
- On the” steerability” of generative adversarial networks. arXiv preprint arXiv:1907.07171 (2019).
- Alias-Free Generative Adversarial Networks.
- A style-based generator architecture for generative adversarial networks. In CVPR.
- Imagic: Text-based real image editing with diffusion models. In CVPR.
- 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics 42, 4 (2023).
- Lerf: Language embedded radiance fields. In CVPR. 19729–19739.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR.
- Segment anything. arXiv preprint arXiv:2304.02643 (2023).
- Mine: Towards continuous depth mpi with nerf for novel view synthesis. In ICCV.
- Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9298–9309.
- Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023).
- ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors. arXiv preprint arXiv:2312.13324 (2023).
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021).
- Object 3dit: Language-guided 3d-aware image editing. Advances in Neural Information Processing Systems 36 (2024).
- Styleclip: Text-driven manipulation of stylegan imagery. In CVPR.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).
- Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843 (2023).
- LangSplat: 3D Language Gaussian Splatting. In CVPR.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- High-Resolution Image Synthesis With Latent Diffusion Models. In CVPR.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR.
- InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs. IEEE TPAMI (2020).
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020).
- Dual diffusion implicit bridges for image-to-image translation. arXiv preprint arXiv:2203.08382 (2022).
- Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023).
- Richard Tucker and Noah Snavely. 2020. Single-view view synthesis with multiplane images. In CVPR.
- Synsin: End-to-end view synthesis from a single image. In CVPR.
- GPT-4V (ision) is a Human-Aligned Evaluator for Text-to-3D Generation. arXiv preprint arXiv:2401.04092 (2024).
- Generative Hierarchical Features from Synthesizing Images. In CVPR.
- Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217 (2023).
- Semantic hierarchy emerges in deep generative representations for scene synthesis. IJCV (2021).
- Depth anything: Unleashing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891 (2024).
- Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101 (2023).
- Image Sculpting: Precise Object Editing with 3D Geometry Control. arXiv preprint arXiv:2401.01702 (2024).
- pixelnerf: Neural radiance fields from one or few images. In CVPR.
- WonderJourney: Going from Anywhere to Everywhere. arXiv preprint arXiv:2312.03884 (2023).
- Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv preprint arXiv:2306.14289 (2023).
- Scenewiz3d: Towards text-guided 3d scene composition. arXiv preprint arXiv:2312.08885 (2023).
- In-domain gan inversion for real image editing. In ECCV.
- Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. arXiv preprint arXiv:2312.09147 (2023).