Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Interactive3D: Create What You Want by Interactive 3D Generation (2404.16510v1)

Published 25 Apr 2024 in cs.GR and cs.CV

Abstract: 3D object generation has undergone significant advancements, yielding high-quality results. However, fall short of achieving precise user control, often yielding results that do not align with user expectations, thus limiting their applicability. User-envisioning 3D object generation faces significant challenges in realizing its concepts using current generative models due to limited interaction capabilities. Existing methods mainly offer two approaches: (i) interpreting textual instructions with constrained controllability, or (ii) reconstructing 3D objects from 2D images. Both of them limit customization to the confines of the 2D reference and potentially introduce undesirable artifacts during the 3D lifting process, restricting the scope for direct and versatile 3D modifications. In this work, we introduce Interactive3D, an innovative framework for interactive 3D generation that grants users precise control over the generative process through extensive 3D interaction capabilities. Interactive3D is constructed in two cascading stages, utilizing distinct 3D representations. The first stage employs Gaussian Splatting for direct user interaction, allowing modifications and guidance of the generative direction at any intermediate step through (i) Adding and Removing components, (ii) Deformable and Rigid Dragging, (iii) Geometric Transformations, and (iv) Semantic Editing. Subsequently, the Gaussian splats are transformed into InstantNGP. We introduce a novel (v) Interactive Hash Refinement module to further add details and extract the geometry in the second stage. Our experiments demonstrate that Interactive3D markedly improves the controllability and quality of 3D generation. Our project webpage is available at \url{https://interactive-3d.github.io/}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Learning representations and generative models for 3d point clouds. In ICML, 2018.
  2. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In ICCV, 2023a.
  3. Learning implicit fields for generative shape modeling. In CVPR, 2019.
  4. Text-to-3d using gaussian splatting. arXiv preprint arXiv:2309.16585, 2023b.
  5. 3d shape induction from 2d views of multiple objects. In 3DV, 2017.
  6. Get3d: A generative model of high quality 3d textured shapes learned from images. NeurIPS, 2022.
  7. Escaping plato’s cave: 3d shape from adversarial rendering. In ICCV, 2019.
  8. Denoising diffusion probabilistic models. NeurIPS, 2020.
  9. Zero-shot text-guided object generation with dream fields. In CVPR, 2022.
  10. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  11. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4):1–14, 2023.
  12. Segment anything. In ICCV, 2023.
  13. Magic3d: High-resolution text-to-3d content creation. In CVPR, 2023.
  14. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. NeurIPS, 2024.
  15. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023.
  16. Sparseneus: Fast generalizable neural surface reconstruction from sparse views. In ECCV, 2022.
  17. Inverse graphics gan: Learning to generate 3d shapes from unstructured 2d data. arXiv preprint arXiv:2002.12674, 2020.
  18. Scalable 3d captioning with pretrained models. NeurIPS, 2024.
  19. Occupancy networks: Learning 3d reconstruction in function space. In CVPR, 2019.
  20. Latent-nerf for shape-guided generation of 3d shapes and textures. In CVPR, 2023.
  21. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  22. Structurenet: Hierarchical graph networks for 3d shape generation. arXiv preprint arXiv:1908.00575, 2019.
  23. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics, 41(4):102:1–102:15, 2022.
  24. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  25. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  26. Drag your gan: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH 2023 Conference Proceedings, 2023.
  27. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
  28. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  29. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  30. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  31. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  32. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. NeurIPS, 2021.
  33. Improved adversarial systems for 3d object generation and reconstruction. In CoRL, 2017.
  34. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023.
  35. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. NeurIPS, 2024.
  36. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. NeurIPS, 2016.
  37. Pointflow: 3d point cloud generation with continuous normalizing flows. In ICCV, 2019.
  38. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  39. Image gans meet differentiable rendering for inverse graphics and interpretable 3d neural rendering. arXiv preprint arXiv:2010.09125, 2020.
Citations (5)

Summary

  • The paper presents a two-stage approach that integrates interactive Gaussian Splatting with InstantNGP refinement for precise 3D object generation.
  • It enables direct manipulation through part addition, transformation, and semantic editing, effectively translating user intent into detailed 3D forms.
  • Quantitative results show higher CLIP R-Precision (up to 0.94) and improved efficiency, demonstrating its potential for interactive 3D modeling applications.

Interactive3D is a novel framework for controllable and high-quality 3D object generation that addresses the limitations of current methods, which often lack precise user control and struggle to translate user intent accurately into 3D shapes. The paper "Interactive3D: Create What You Want by Interactive 3D Generation" (Interactive3D: Create What You Want by Interactive 3D Generation, 25 Apr 2024) introduces a two-stage approach that leverages different 3D representations to facilitate user interaction throughout the generation process.

The core idea is to allow users to directly modify intermediate 3D outputs, guiding the generative process towards their desired outcome. This is achieved through a cascading structure utilizing Gaussian Splatting in the first stage for flexible interaction and InstantNGP in the second stage for detailed refinement and geometry extraction.

Stage I: Interaction with Gaussian Splatting

In the initial stage, the 3D object is represented as a set of Gaussian blobs. This explicit representation is chosen because it is amenable to direct manipulation by users. Users can interact with the Gaussian blobs in several ways:

  • Adding and Removing Parts: Users can combine existing Gaussian blob sets to add parts or delete blobs within a specified region to remove parts. Part selection can be done using 2D segmentation masks derived from rendered views or by directly selecting points in 3D space.
  • Geometry Transformation: Selected parts (defined by a bounding box) can undergo standard transformations such as rotation, translation, and stretching. The transformation is applied to the Gaussian blobs within the bounding box, and the modified set is combined with the rest of the object.
  • Deformable and Rigid Dragging: Inspired by DragGAN, users can select a source point, a target point, and a local region radius around the source point. The Gaussian blobs within this region are iteratively moved towards the target point during optimization using a motion supervision loss ($L_{\text{motion}$). Deformable dragging allows the local structure to change, while rigid dragging maintains the local structure using a rigid constraint loss ($L_{\text{rigid}$). This enables operations like pulling new features or repositioning parts without deformation.
  • Semantic Editing: Users can apply text prompts to selected parts. An Interactive SDS loss is used to optimize the Gaussian blobs in the selected region to match the new semantic description (e.g., making wings appear aflame).

To enhance efficiency during interaction-guided optimization, the framework employs an Interactive SDS Loss. This involves an adaptive camera zoom-in strategy to focus rendering and SDS loss computation on the modified region, and allows adjusting the denoising step tt of the diffusion model used for SDS based on the generation stage.

Stage II: Refinement with InstantNGP

After Stage I, the generated Gaussian blobs, while flexible for interaction, may contain artifacts and are not ideal for high-quality geometry extraction (like meshes). Therefore, the Gaussian representation is converted to InstantNGP.

  • NeRF Distillation: A simple distillation process is used to convert the Gaussian blobs to InstantNGP. This involves supervising renderings from the InstantNGP representation with renderings from the Gaussian Splatting representation using an L1 loss ($L_{\text{distill$).
  • Interactive Hash Refinement: Standard InstantNGP can suffer from limited capacity and hash conflicts, making localized refinement difficult. Interactive3D introduces a novel module that fixes the original InstantNGP and adds new learnable residual features specifically for interested regions selected by the user. This refinement uses part-specific multi-level hash tables and lightweight MLPs to add residual densities and colors, focusing detail enhancement on local surface areas. This allows for precise control over refinement, avoiding negative impacts on other parts of the object.

The Interactive Hash Refinement module allows users to control the refinement process by selecting areas and adjusting the levels and capacities of the refinement hash tables. Optimization in this stage also uses the Interactive SDS loss, focusing on the selected region.

Implementation Details and Considerations:

The framework is implemented in two stages using existing techniques for Gaussian Splatting (Text-to-3D using Gaussian Splatting, 2023) and InstantNGP (Classifier-Free Diffusion Guidance, 2022) as foundations. Key details include:

  • Training is performed on NVIDIA A100 GPUs, typically for 20k steps (10k per stage).
  • The AdamW optimizer is used.
  • Stage I initializes with points sampled from a pre-trained 3D model distribution (e.g., Shap-E (Shap-E: Generating Conditional 3D Implicit Functions, 2023)).
  • A 2D diffusion model (e.g., Stable Diffusion 2.1 [metzer2022latent]) serves as the prior for the SDS loss in both stages.
  • Specific learning rates are used for different Gaussian parameters in Stage I.
  • The Hash Refinement module in Stage II uses multi-level hash tables (default 8 levels) with a feature dimension of 2 per position and a capacity of 2192^{19} per table.
  • Rendering resolution during training is 256x256.
  • Orientation loss is used as a regularization.
  • The Hash Refinement module is implemented in parallel using CUDA for efficiency.

Real-World Applications and Performance:

Interactive3D enables applications where users need to create specific, customized 3D models that cannot be easily generated with text prompts alone. Examples include:

  • Designing characters with specific poses or features by dragging and transforming parts.
  • Modifying existing generative outputs to fix artifacts or add details.
  • Combining components from different generated or pre-existing models.
  • Applying semantic edits (like making parts look aflame or metallic) locally.

The paper presents both qualitative and quantitative results demonstrating the effectiveness of Interactive3D. Qualitatively, it shows diverse examples of complex 3D objects generated with high fidelity and precise control through various interactions (dragging body parts, opening a dragon's mouth, creating a watermelon monster, refining details on a robot). Quantitatively, compared to methods like DreamFusion [poole2022dreamfusion] and ProlificDreamer [wang2023prolificdreamer], Interactive3D achieves a higher CLIP R-Precision (0.94 vs 0.67/0.83), indicating better alignment with textual descriptions, and is more efficient with a lower average generation time (50 minutes vs 1.1h/3.4h). User studies indicated a high preference for Interactive3D (95.5%) and fewer attempts needed (1.4 vs 2.3) compared to text-only methods.

Limitations:

The paper acknowledges that Interactive3D can be susceptible to failure under excessive or unreasonable user manipulation. Furthermore, it inherits some limitations from the underlying generative models and representations it uses, such as potential issues with color saturation.

Reddit Logo Streamline Icon: https://streamlinehq.com