Interactive3D: Create What You Want by Interactive 3D Generation (2404.16510v1)

Published 25 Apr 2024 in cs.GR and cs.CV

Abstract: 3D object generation has undergone significant advancements, yielding high-quality results. However, fall short of achieving precise user control, often yielding results that do not align with user expectations, thus limiting their applicability. User-envisioning 3D object generation faces significant challenges in realizing its concepts using current generative models due to limited interaction capabilities. Existing methods mainly offer two approaches: (i) interpreting textual instructions with constrained controllability, or (ii) reconstructing 3D objects from 2D images. Both of them limit customization to the confines of the 2D reference and potentially introduce undesirable artifacts during the 3D lifting process, restricting the scope for direct and versatile 3D modifications. In this work, we introduce Interactive3D, an innovative framework for interactive 3D generation that grants users precise control over the generative process through extensive 3D interaction capabilities. Interactive3D is constructed in two cascading stages, utilizing distinct 3D representations. The first stage employs Gaussian Splatting for direct user interaction, allowing modifications and guidance of the generative direction at any intermediate step through (i) Adding and Removing components, (ii) Deformable and Rigid Dragging, (iii) Geometric Transformations, and (iv) Semantic Editing. Subsequently, the Gaussian splats are transformed into InstantNGP. We introduce a novel (v) Interactive Hash Refinement module to further add details and extract the geometry in the second stage. Our experiments demonstrate that Interactive3D markedly improves the controllability and quality of 3D generation. Our project webpage is available at \url{https://interactive-3d.github.io/}.

References (39)

Citations (5)

View on Semantic Scholar

Summary

The paper presents a two-stage approach that integrates interactive Gaussian Splatting with InstantNGP refinement for precise 3D object generation.
It enables direct manipulation through part addition, transformation, and semantic editing, effectively translating user intent into detailed 3D forms.
Quantitative results show higher CLIP R-Precision (up to 0.94) and improved efficiency, demonstrating its potential for interactive 3D modeling applications.

Interactive3D is a novel framework for controllable and high-quality 3D object generation that addresses the limitations of current methods, which often lack precise user control and struggle to translate user intent accurately into 3D shapes. The paper "Interactive3D: Create What You Want by Interactive 3D Generation" (Interactive3D: Create What You Want by Interactive 3D Generation, 25 Apr 2024) introduces a two-stage approach that leverages different 3D representations to facilitate user interaction throughout the generation process.

The core idea is to allow users to directly modify intermediate 3D outputs, guiding the generative process towards their desired outcome. This is achieved through a cascading structure utilizing Gaussian Splatting in the first stage for flexible interaction and InstantNGP in the second stage for detailed refinement and geometry extraction.

Stage I: Interaction with Gaussian Splatting

In the initial stage, the 3D object is represented as a set of Gaussian blobs. This explicit representation is chosen because it is amenable to direct manipulation by users. Users can interact with the Gaussian blobs in several ways:

Adding and Removing Parts: Users can combine existing Gaussian blob sets to add parts or delete blobs within a specified region to remove parts. Part selection can be done using 2D segmentation masks derived from rendered views or by directly selecting points in 3D space.
Geometry Transformation: Selected parts (defined by a bounding box) can undergo standard transformations such as rotation, translation, and stretching. The transformation is applied to the Gaussian blobs within the bounding box, and the modified set is combined with the rest of the object.
Deformable and Rigid Dragging: Inspired by DragGAN, users can select a source point, a target point, and a local region radius around the source point. The Gaussian blobs within this region are iteratively moved towards the target point during optimization using a motion supervision loss ($L_{\text{motion}$). Deformable dragging allows the local structure to change, while rigid dragging maintains the local structure using a rigid constraint loss ($L_{\text{rigid}$). This enables operations like pulling new features or repositioning parts without deformation.
Semantic Editing: Users can apply text prompts to selected parts. An Interactive SDS loss is used to optimize the Gaussian blobs in the selected region to match the new semantic description (e.g., making wings appear aflame).

To enhance efficiency during interaction-guided optimization, the framework employs an Interactive SDS Loss. This involves an adaptive camera zoom-in strategy to focus rendering and SDS loss computation on the modified region, and allows adjusting the denoising step $t$ of the diffusion model used for SDS based on the generation stage.

Stage II: Refinement with InstantNGP

After Stage I, the generated Gaussian blobs, while flexible for interaction, may contain artifacts and are not ideal for high-quality geometry extraction (like meshes). Therefore, the Gaussian representation is converted to InstantNGP.

NeRF Distillation: A simple distillation process is used to convert the Gaussian blobs to InstantNGP. This involves supervising renderings from the InstantNGP representation with renderings from the Gaussian Splatting representation using an L1 loss ($L_{\text{distill$).
Interactive Hash Refinement: Standard InstantNGP can suffer from limited capacity and hash conflicts, making localized refinement difficult. Interactive3D introduces a novel module that fixes the original InstantNGP and adds new learnable residual features specifically for interested regions selected by the user. This refinement uses part-specific multi-level hash tables and lightweight MLPs to add residual densities and colors, focusing detail enhancement on local surface areas. This allows for precise control over refinement, avoiding negative impacts on other parts of the object.

The Interactive Hash Refinement module allows users to control the refinement process by selecting areas and adjusting the levels and capacities of the refinement hash tables. Optimization in this stage also uses the Interactive SDS loss, focusing on the selected region.

Implementation Details and Considerations:

The framework is implemented in two stages using existing techniques for Gaussian Splatting (Text-to-3D using Gaussian Splatting, 2023) and InstantNGP (Classifier-Free Diffusion Guidance, 2022) as foundations. Key details include:

Training is performed on NVIDIA A100 GPUs, typically for 20k steps (10k per stage).
The AdamW optimizer is used.
Stage I initializes with points sampled from a pre-trained 3D model distribution (e.g., Shap-E (Shap-E: Generating Conditional 3D Implicit Functions, 2023)).
A 2D diffusion model (e.g., Stable Diffusion 2.1 [metzer2022latent]) serves as the prior for the SDS loss in both stages.
Specific learning rates are used for different Gaussian parameters in Stage I.
The Hash Refinement module in Stage II uses multi-level hash tables (default 8 levels) with a feature dimension of 2 per position and a capacity of $2^{19}$ per table.
Rendering resolution during training is 256x256.
Orientation loss is used as a regularization.
The Hash Refinement module is implemented in parallel using CUDA for efficiency.

Real-World Applications and Performance:

Interactive3D enables applications where users need to create specific, customized 3D models that cannot be easily generated with text prompts alone. Examples include:

Designing characters with specific poses or features by dragging and transforming parts.
Modifying existing generative outputs to fix artifacts or add details.
Combining components from different generated or pre-existing models.
Applying semantic edits (like making parts look aflame or metallic) locally.

The paper presents both qualitative and quantitative results demonstrating the effectiveness of Interactive3D. Qualitatively, it shows diverse examples of complex 3D objects generated with high fidelity and precise control through various interactions (dragging body parts, opening a dragon's mouth, creating a watermelon monster, refining details on a robot). Quantitatively, compared to methods like DreamFusion [poole2022dreamfusion] and ProlificDreamer [wang2023prolificdreamer], Interactive3D achieves a higher CLIP R-Precision (0.94 vs 0.67/0.83), indicating better alignment with textual descriptions, and is more efficient with a lower average generation time (50 minutes vs 1.1h/3.4h). User studies indicated a high preference for Interactive3D (95.5%) and fewer attempts needed (1.4 vs 2.3) compared to text-only methods.

Limitations:

The paper acknowledges that Interactive3D can be susceptible to failure under excessive or unreasonable user manipulation. Furthermore, it inherits some limitations from the underlying generative models and representations it uses, such as potential issues with color saturation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1783676359505514871

https://twitter.com/janusch_patas/status/1783722955748348016

https://twitter.com/danxuhk/status/1783691665557029105

https://twitter.com/fly51fly/status/1783979838984073456

https://twitter.com/amoufarek/status/1784488381520507240

https://twitter.com/arxivsanitybot/status/1784040452754395419

Reddit

[2404.16510] Interactive3D: Create What You Want by Interactive 3D Generation (1 point, 0 comments)