Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting (2405.18424v1)

Published 28 May 2024 in cs.CV

Abstract: Scene image editing is crucial for entertainment, photography, and advertising design. Existing methods solely focus on either 2D individual object or 3D global scene editing. This results in a lack of a unified approach to effectively control and manipulate scenes at the 3D level with different levels of granularity. In this work, we propose 3DitScene, a novel and unified scene editing framework leveraging language-guided disentangled Gaussian Splatting that enables seamless editing from 2D to 3D, allowing precise control over scene composition and individual objects. We first incorporate 3D Gaussians that are refined through generative priors and optimization techniques. Language features from CLIP then introduce semantics into 3D geometry for object disentanglement. With the disentangled Gaussians, 3DitScene allows for manipulation at both the global and individual levels, revolutionizing creative expression and empowering control over scenes and objects. Experimental results demonstrate the effectiveness and versatility of 3DitScene in scene image editing. Code and online demo can be found at our project homepage: https://zqh0253.github.io/3DitScene/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Generative novel view synthesis with 3d-aware diffusion models. ICCV (2023).
  2. Scenetex: High-quality texture synthesis for indoor scenes via diffusion priors. arXiv preprint arXiv:2311.17261 (2023).
  3. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481 (2023).
  4. Scenedreamer: Unbounded 3d scene generation from 2d image collections. arXiv preprint arXiv:2302.01330 (2023).
  5. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384 (2023).
  6. Disentangled 3D Scene Generation with Layout Learning. arXiv preprint arXiv:2402.16936 (2024).
  7. Deepview: View synthesis with learned gradient descent. In CVPR.
  8. GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image. arXiv preprint arXiv:2403.12013 (2024).
  9. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
  10. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In ICML.
  11. Single-view view synthesis in the wild with learned adaptive multiplane images. In ACM SIGGRAPH Conference Proceedings.
  12. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).
  13. Style aligned image generation via shared attention. arXiv preprint arXiv:2312.02133 (2023).
  14. Denoising diffusion probabilistic models. (2020).
  15. Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).
  16. Text2room: Extracting textured 3d meshes from 2d text-to-image models. arXiv preprint arXiv:2303.11989 (2023).
  17. LRM: Large Reconstruction Model for Single Image to 3D. arXiv preprint arXiv:2311.04400 (2023).
  18. Worldsheet: Wrapping the world in a 3d sheet for view synthesis from a single image. In ICCV.
  19. OpenCLIP. https://doi.org/10.5281/zenodo.5143773 If you use this software, please cite it as below..
  20. On the” steerability” of generative adversarial networks. arXiv preprint arXiv:1907.07171 (2019).
  21. Alias-Free Generative Adversarial Networks.
  22. A style-based generator architecture for generative adversarial networks. In CVPR.
  23. Imagic: Text-based real image editing with diffusion models. In CVPR.
  24. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics 42, 4 (2023).
  25. Lerf: Language embedded radiance fields. In CVPR. 19729–19739.
  26. Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR.
  27. Segment anything. arXiv preprint arXiv:2304.02643 (2023).
  28. Mine: Towards continuous depth mpi with nerf for novel view synthesis. In ICCV.
  29. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9298–9309.
  30. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023).
  31. ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors. arXiv preprint arXiv:2312.13324 (2023).
  32. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021).
  33. Object 3dit: Language-guided 3d-aware image editing. Advances in Neural Information Processing Systems 36 (2024).
  34. Styleclip: Text-driven manipulation of stylegan imagery. In CVPR.
  35. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).
  36. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843 (2023).
  37. LangSplat: 3D Language Gaussian Splatting. In CVPR.
  38. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  39. High-Resolution Image Synthesis With Latent Diffusion Models. In CVPR.
  40. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR.
  41. InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs. IEEE TPAMI (2020).
  42. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020).
  43. Dual diffusion implicit bridges for image-to-image translation. arXiv preprint arXiv:2203.08382 (2022).
  44. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023).
  45. Richard Tucker and Noah Snavely. 2020. Single-view view synthesis with multiplane images. In CVPR.
  46. Synsin: End-to-end view synthesis from a single image. In CVPR.
  47. GPT-4V (ision) is a Human-Aligned Evaluator for Text-to-3D Generation. arXiv preprint arXiv:2401.04092 (2024).
  48. Generative Hierarchical Features from Synthesizing Images. In CVPR.
  49. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217 (2023).
  50. Semantic hierarchy emerges in deep generative representations for scene synthesis. IJCV (2021).
  51. Depth anything: Unleashing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891 (2024).
  52. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101 (2023).
  53. Image Sculpting: Precise Object Editing with 3D Geometry Control. arXiv preprint arXiv:2401.01702 (2024).
  54. pixelnerf: Neural radiance fields from one or few images. In CVPR.
  55. WonderJourney: Going from Anywhere to Everywhere. arXiv preprint arXiv:2312.03884 (2023).
  56. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv preprint arXiv:2306.14289 (2023).
  57. Scenewiz3d: Towards text-guided 3d scene composition. arXiv preprint arXiv:2312.08885 (2023).
  58. In-domain gan inversion for real image editing. In ECCV.
  59. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. arXiv preprint arXiv:2312.09147 (2023).
Citations (2)

Summary

  • The paper introduces a unified framework that converts 2D images into detailed 3D Gaussian representations using monocular depth estimation and diffusion-based optimization.
  • It employs CLIP embeddings and the Segment Anything Model for semantic segmentation, enabling flexible, object-level scene manipulation.
  • Quantitative studies and comparisons demonstrate enhanced visual fidelity and 3D consistency over prior methods, broadening creative control in scene editing.

Language-guided Disentangled Gaussian Splatting for 3D-aware Scene Image Editing

The research presented in "black: Editing Any Scene via Language-guided Disentangled Gaussian Splatting" addresses the pervasive limitations in current methods for scene image editing, which are often confined to either 2D object manipulations or 3D scene transformations. The authors introduce a unified framework, termed as "black," which leverages language-guided disentangled Gaussian Splatting for comprehensive and precise control over both 2D and 3D scene elements.

Methodology

3D Gaussian Splatting from Single Image:

The core methodology of the paper relies on the extension and refinement of 3D Gaussian Splatting (3DGS). By projecting a given 2D image into a 3D space through monocular depth estimation and a rasterization process, the scene initially derives 3D Gaussians which are subsequently optimized using generative priors. Unlike previous methods that often result in inconsistent 3D geometries, the combination of Stable Diffusion's SDS loss and reconstruction loss in this paper ensures improved results. Additionally, the authors employ a novel 3D inpainting method informed by diffusion-based depth estimation to handle novel views, addressing previous limitations in depth alignment and occlusion artifacts.

Language-guided Disentangled Gaussian Splatting:

This method introduces semantic understanding into the 3D Gaussians using CLIP embeddings, enabling the scene to be disentangled into individual semantic components. Utilizing Segment Anything Model (SAM) for initial object segmentation, these semantic features are distilled, allowing for flexible object-level manipulation. This multi-stage embedding not only aids in accurate object identification, but also enhances scene layout augmentation during the optimization process, thus smoothing out occluded regions and further improving rendered scene quality.

Training and Inference

The training process is orchestrated with three critical loss functions—reconstruction loss, SDS loss, and distillation loss—balancing between visual fidelity and semantic accuracy. The ability to query objects using textual or bounding box prompts during inference provides an unprecedented control over scene editing, allowing users to reposition, re-scale, or remove objects within a complex scene while maintaining 3D consistency.

Results and Comparisons

The experimental evaluations demonstrate meaningful improvements over existing methods such as AnyDoor, Object 3DIT, Image Sculpting, AdaMPI, and LucidDreamer. Quantitative user studies validate that black outperforms these baselines in terms of both consistency and visual quality. Crucially, the flexibility and control provided by the disentangled 3D representation substantially enhance the creative potential for editing tasks.

Implications and Future Work

This research unfolds significant theoretical and practical implications. Theoretically, it advances the representation techniques for 3D-aware semantic understanding in scene composition. Practically, it offers robust tools for industries reliant on visual content creation such as film, photography, and marketing, allowing for unprecedented levels of detail and creative control.

Looking forward, the extension of this framework could involve integrating more sophisticated generative models to handle extreme edge cases, enhancing real-time performance for interactive applications, and applying this methodology to more complex dynamic scenes. Despite the state-of-the-art nature of black, challenges remain in achieving lifelike texture transformations and handling highly complex interactions between multiple objects.

In conclusion, this paper provides a comprehensive framework that effectively bridges the gap between 2D and 3D scene editing, leveraging both language embeddings and a novel 3D Gaussian Splatting methodology. The results significantly enhance current capabilities in scene image editing, presenting both theoretical advancements and practical applications across several domains.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com