- The paper presents a framework that accelerates 3D scene generation by introducing Fast LAyered Gaussian Surfels (FLAGS) and guided depth diffusion.
- The method achieves scene generation in under 10 seconds while maintaining geometric consistency and reducing boundary distortions.
- Evaluations demonstrate that WonderWorld outperforms baselines in both quality and speed, enabling practical applications in VR and game development.
The paper "WonderWorld: Interactive 3D Scene Generation from a Single Image" (2406.09394) presents a novel framework for generating connected 3D scenes interactively from a single input image. The core motivation is to overcome the limitations of existing methods which are typically slow, offline processes requiring tens of minutes to hours to generate a fixed scene, making them unsuitable for applications demanding user interaction and low latency like game development or virtual reality.
The paper identifies two main challenges for interactive 3D scene generation:
- Slow generation speed: Existing methods often require generating numerous views to cover occluded regions and involve time-consuming optimization of complex 3D representations like NeRFs or 3D Gaussian Splatting.
- Geometric distortion: Scenes generated sequentially can exhibit misalignment and distortion at the boundaries, creating noticeable seams when connected.
To address these challenges, WonderWorld introduces two key technical contributions:
- Fast LAyered Gaussian Surfels (FLAGS): A novel 3D scene representation and an algorithm to generate it rapidly from a single view.
- Guided Depth Diffusion: A method to improve geometric consistency between newly generated scenes and existing ones.
WonderWorld Framework Overview:
The system starts with a single input image and allows a user to interactively control the generation process. The user can specify where to generate new content by moving a rendering camera and what to generate by providing text prompts. The system then generates a new scene in under 10 seconds and connects it to the existing scene graph. The process involves a control loop that asynchronously handles real-time rendering for user interaction and scene generation based on user input and camera position.
Fast LAyered Gaussian Surfels (FLAGS):
FLAGS represent a scene as a union of three radiance field layers: foreground (Lfg), background (Lbg), and sky (Lsky). Each layer comprises a set of surfels, parameterized by 3D position, orientation (quaternion), 2D scales, opacity, and color. FLAGS are described as a variant of 3D Gaussian Splatting where the Z-axis scale of the Gaussian kernel is shrunk, and view-dependent colors are removed. This allows utilizing the efficient 3DGS rendering pipeline.
The fast generation is achieved through:
- Single-view layer generation: Instead of relying on multi-view synthesis to inpaint occluded regions, WonderWorld generates scene content from a single view using a text-guided diffusion model (like Stable Diffusion Inpaint) and an LLM (like GPT-4) to generate structured prompts. The scene image is then decomposed into foreground, background, and sky layers using depth edges and object segmentation (e.g., with OneFormer). Occluded areas within layers are filled using inpainting.
- Geometry-based initialization: This is crucial for reducing optimization time. Surfels are initialized based on pixel-aligned estimated geometry. For valid pixels in a layer image, a surfel is spawned.
- Color: Initialized directly from the pixel color.
- Position: Unprojected from the pixel coordinates using the camera matrix and estimated monocular depth (e.g., from Marigold Depth).
- Orientation: Derived from an estimated normal map (e.g., from Marigold Normal).
- Scales: Calculated to provide seamless coverage without excessive overlap, based on estimated depth, focal length, and the angle between the surfel normal and the image plane normal.
- Optimization: Layers are optimized sequentially from back to front (sky, then background, then foreground) using a photometric loss against the target layer image. The optimization fine-tunes opacity, orientation, and scales based on the geometry-based initialization, typically requiring only 100 iterations of Adam and without requiring densification steps common in 3DGS, which significantly speeds up the process.
Guided Depth Diffusion:
To ensure geometric consistency at the boundaries between existing and newly generated scenes, WonderWorld employs a guided depth diffusion mechanism. An off-the-shelf latent depth diffusion model estimates the depth map for the newly generated scene image. This process is guided by incorporating the rendered depth of visible existing content at the outpainting viewpoint. This guidance encourages the generated depth map to align with the known geometry, mitigating seams and distortions. The guidance is applied in the latent space during the denoising process, modifying the predicted noise based on the discrepancy between the decoded depth prediction and the guide depth in masked regions. The framework is flexible enough to incorporate additional constraints, such as enforcing a flat ground plane.
Implementation Details:
The paper mentions using specific models like Stable Diffusion Inpaint for outpainting and inpainting, OneFormer for segmentation, Marigold Depth and Normal for geometry estimation, and GPT-4/GPT-4V for prompt generation and initial captioning. Scene images are 512×512. Camera focal length is fixed at 960 pixels. Depth post-processing using SAM is also applied.
Results and Evaluation:
WonderWorld is evaluated against methods like WonderJourney [yu2023wonderjourney], LucidDreamer [chung2023luciddreamer], and Text2Room [hollein2023text2room]. The key result is the generation speed: WonderWorld generates a scene in around 9.5 seconds on a single A6000 GPU, significantly faster than the baselines which take over 700 seconds. Qualitative comparisons and human studies show that WonderWorld produces higher-quality, more consistent, and less distorted scenes, especially at boundaries, leading to overwhelming preference in 2AFC tests. Quantitative metrics (CLIP score, CLIP consistency, image quality, aesthetic score) also favor WonderWorld. Ablation studies confirm the critical contributions of geometry-based initialization, the layered structure, and guided depth diffusion to the framework's performance.
Limitations:
The authors acknowledge limitations, including:
- The generated scenes mainly have frontal-facing surfaces, limiting view synthesis primarily to viewpoints around the camera position and preventing moving behind objects.
- Handling of detailed objects like trees can be challenging, potentially resulting in holes or floaters.
- The framework is presented as an interactive prototyping tool rather than a final high-fidelity solution, suggesting potential future work in refining the generated scenes with slower but higher-fidelity models.
Overall, WonderWorld demonstrates a significant step towards practical, interactive 3D scene generation by focusing on speed and geometric coherence through its novel FLAGS representation and guided depth diffusion approach.