3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting (2405.18424v1)

Published 28 May 2024 in cs.CV

Abstract: Scene image editing is crucial for entertainment, photography, and advertising design. Existing methods solely focus on either 2D individual object or 3D global scene editing. This results in a lack of a unified approach to effectively control and manipulate scenes at the 3D level with different levels of granularity. In this work, we propose 3DitScene, a novel and unified scene editing framework leveraging language-guided disentangled Gaussian Splatting that enables seamless editing from 2D to 3D, allowing precise control over scene composition and individual objects. We first incorporate 3D Gaussians that are refined through generative priors and optimization techniques. Language features from CLIP then introduce semantics into 3D geometry for object disentanglement. With the disentangled Gaussians, 3DitScene allows for manipulation at both the global and individual levels, revolutionizing creative expression and empowering control over scenes and objects. Experimental results demonstrate the effectiveness and versatility of 3DitScene in scene image editing. Code and online demo can be found at our project homepage: https://zqh0253.github.io/3DitScene/.

References (59)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a unified framework that converts 2D images into detailed 3D Gaussian representations using monocular depth estimation and diffusion-based optimization.
It employs CLIP embeddings and the Segment Anything Model for semantic segmentation, enabling flexible, object-level scene manipulation.
Quantitative studies and comparisons demonstrate enhanced visual fidelity and 3D consistency over prior methods, broadening creative control in scene editing.

Language-guided Disentangled Gaussian Splatting for 3D-aware Scene Image Editing

The research presented in "black: Editing Any Scene via Language-guided Disentangled Gaussian Splatting" addresses the pervasive limitations in current methods for scene image editing, which are often confined to either 2D object manipulations or 3D scene transformations. The authors introduce a unified framework, termed as "black," which leverages language-guided disentangled Gaussian Splatting for comprehensive and precise control over both 2D and 3D scene elements.

Methodology

3D Gaussian Splatting from Single Image:

The core methodology of the paper relies on the extension and refinement of 3D Gaussian Splatting (3DGS). By projecting a given 2D image into a 3D space through monocular depth estimation and a rasterization process, the scene initially derives 3D Gaussians which are subsequently optimized using generative priors. Unlike previous methods that often result in inconsistent 3D geometries, the combination of Stable Diffusion's SDS loss and reconstruction loss in this paper ensures improved results. Additionally, the authors employ a novel 3D inpainting method informed by diffusion-based depth estimation to handle novel views, addressing previous limitations in depth alignment and occlusion artifacts.

Language-guided Disentangled Gaussian Splatting:

This method introduces semantic understanding into the 3D Gaussians using CLIP embeddings, enabling the scene to be disentangled into individual semantic components. Utilizing Segment Anything Model (SAM) for initial object segmentation, these semantic features are distilled, allowing for flexible object-level manipulation. This multi-stage embedding not only aids in accurate object identification, but also enhances scene layout augmentation during the optimization process, thus smoothing out occluded regions and further improving rendered scene quality.

Training and Inference

The training process is orchestrated with three critical loss functions—reconstruction loss, SDS loss, and distillation loss—balancing between visual fidelity and semantic accuracy. The ability to query objects using textual or bounding box prompts during inference provides an unprecedented control over scene editing, allowing users to reposition, re-scale, or remove objects within a complex scene while maintaining 3D consistency.

Results and Comparisons

The experimental evaluations demonstrate meaningful improvements over existing methods such as AnyDoor, Object 3DIT, Image Sculpting, AdaMPI, and LucidDreamer. Quantitative user studies validate that black outperforms these baselines in terms of both consistency and visual quality. Crucially, the flexibility and control provided by the disentangled 3D representation substantially enhance the creative potential for editing tasks.

Implications and Future Work

This research unfolds significant theoretical and practical implications. Theoretically, it advances the representation techniques for 3D-aware semantic understanding in scene composition. Practically, it offers robust tools for industries reliant on visual content creation such as film, photography, and marketing, allowing for unprecedented levels of detail and creative control.

Looking forward, the extension of this framework could involve integrating more sophisticated generative models to handle extreme edge cases, enhancing real-time performance for interactive applications, and applying this methodology to more complex dynamic scenes. Despite the state-of-the-art nature of black, challenges remain in achieving lifelike texture transformations and handling highly complex interactions between multiple objects.

In conclusion, this paper provides a comprehensive framework that effectively bridges the gap between 2D and 3D scene editing, leveraging both language embeddings and a novel 3D Gaussian Splatting methodology. The results significantly enhance current capabilities in scene image editing, presenting both theoretical advancements and practical applications across several domains.

PDF Markdown

Related Papers

GitHub

3DitScene
GitHub - zqh0253/3DitScene (179 stars)

Tweets

https://twitter.com/_akhaliq/status/1795804333956554898

https://twitter.com/janusch_patas/status/1795686887434359236

https://twitter.com/taziku_co/status/1796112984097370370

https://twitter.com/arxivsanitybot/status/1796363474471198876

https://twitter.com/IAmACatAI/status/1796089166952100063

https://twitter.com/javaeeeee1/status/1796140013521912092

YouTube

Show All Videos