Wonder3D: Single Image to 3D using Cross-Domain Diffusion
(2310.15008v3)
Published 23 Oct 2023 in cs.CV
Abstract: In this work, we introduce Wonder3D, a novel method for efficiently generating high-fidelity textured meshes from single-view images.Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of image-to-3D tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure consistency, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a geometry-aware normal fusion algorithm that extracts high-quality surfaces from the multi-view 2D representations. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and reasonably good efficiency compared to prior works.
Wonder3D is a novel method for generating high-fidelity textured meshes from a single input image. It addresses limitations of previous single-view 3D reconstruction techniques, such as the time-consuming per-shape optimization in Score Distillation Sampling (SDS) methods (Wang et al., 2023, Chen et al., 2023, Raj et al., 2023, Dar et al., 2022) which can suffer from inconsistent geometry (like the Janus problem), and the often low quality and lack of detail from methods that directly predict 3D information via fast inference (Nichol et al., 2022, Jun et al., 2023, Liu et al., 2023). Wonder3D aims for improved quality, consistency, generalizability, and efficiency.
The core idea is to bypass direct 3D generation or slow per-shape optimization by generating consistent multi-view 2D representations – specifically, both normal maps and color images – using a novel cross-domain diffusion model. These 2D outputs are then used to reconstruct a 3D textured mesh via a robust normal fusion algorithm.
Here's a breakdown of the practical implementation and application of the Wonder3D approach:
1. Multi-view Cross-Domain Diffusion Model
The method extends a pre-trained 2D diffusion model (specifically, the Stable Diffusion Image Variations Model, fine-tuned for image conditions) to handle multiple views and two distinct domains (normal maps and color images) simultaneously. This allows it to leverage the powerful priors learned by large 2D diffusion models while generating geometrically relevant information (normals) alongside appearance (color).
Consistent Multi-view Generation: To ensure consistency across different views, the model incorporates a multi-view attention mechanism. Standard self-attention layers within the diffusion UNet are extended to connect keys and values from different views, allowing information exchange and implicitly encoding multi-view dependencies. This helps the model generate images that look consistent from various angles.
Cross-Domain Diffusion: Handling both normals and colors within one diffusion model is achieved using a "domain switcher." This is a one-dimensional vector, encoded using positional encoding and concatenated with the time embedding, that acts as a conditioning input to the UNet. By training the model to generate normal maps when conditioned with a specific domain switcher value (sn) and color images with another (sc), the model learns to operate on both domains.
Cross-domain Attention: To ensure consistency between the generated normal maps and color images for the same view, a cross-domain attention layer is introduced. This layer, placed before the cross-attention in transformer blocks, combines keys and values from both the normal and color domains, facilitating information flow and correlation between the two outputs.
The model is trained on multi-view normal maps and color images rendered from 3D assets (using the LVIS subset of the Objaverse dataset (Skorokhodov et al., 2023) with random rotations for diversity). The fine-tuning process takes about 30,000 steps, requiring substantial computational resources (e.g., 3 days on 8 Nvidia Tesla A800 GPUs).
2. Textured Mesh Extraction (Geometry-aware Normal Fusion)
Once the multi-view normal maps and color images are generated (typically from 6 views: front, back, left, right, front-right, front-left), they are used to reconstruct a 3D mesh. Instead of traditional multi-view stereo or simple image-based reconstruction, Wonder3D optimizes a neural implicit Signed Distance Field (SDF) (Wang et al., 2021).
Why SDF? SDFs are chosen for their compactness and differentiability, suitable for optimization.
Challenges with Generated Data: Standard SDF reconstruction methods like NeuS (Wang et al., 2021) assume dense, accurate real-world views. The generated views from a diffusion model are sparser and may contain subtle inaccuracies. Directly applying standard methods leads to distorted geometry, outliers, and incompleteness.
Geometry-aware Optimization Scheme: To overcome these issues, a novel optimization objective is proposed. The optimization uses randomly sampled rays from all views and includes standard losses (RGB loss, mask loss from segmented object masks, Eikonal regularization, sparsity regularization, 3D smoothness regularization) along with a specialized geometry-aware normal loss.
Geometry-aware Normal Loss: This loss maximizes the similarity between the normal vector derived from the optimized SDF (g^) and the generated normal map value (g) for each sampled point. Crucially, it introduces a geometric-aware weight (wk) that prioritizes normals that are less aligned with the viewing ray. Normals facing directly away from the camera are expected to have an angle of 180 degrees (cosine -1) with the viewing ray. Normals whose angle with the viewing ray is closer to 90 degrees (cosine 0) might be less reliable or on silhouttes. The weighting function wk=exp(∣cos(vk,gk)∣) for cos(vk,gk)≤ϵ (a negative threshold close to zero, meaning the normal is roughly pointing away from the camera) gives higher weight to normals that are more directly opposed to the viewing direction. This makes the optimization more robust to inaccuracies in generated normals from views where the surface is nearly parallel to the ray.
The normal loss ek for a sampled ray k is 1−cos(g^k,gk). The total normal loss is Lnormal=∑wk1∑wk⋅ek.
Outlier-dropping Losses: To handle inaccuracies in generated masks and color images, an "outlier-dropping" strategy is applied. During optimization, when calculating the color or mask loss, the errors for all sampled rays are sorted, and a predefined percentage of the largest errors are discarded. This prevents noisy or erroneous generated data from significantly degrading the mesh quality.
The mesh extraction process is built upon instant-NGP based SDF reconstruction [instant-nsr-pl] and takes only a few minutes (stated as 2-3 minutes end-to-end in the paper, which includes both generation and reconstruction).
Real-World Applications and Implementation Considerations:
3D Content Creation: Wonder3D provides a fast and high-quality method for generating 3D models from single images, useful for artists, game developers, and VR/AR content creators. Its ability to reconstruct detailed geometry and textures from diverse image styles (like cartoons and sketches) is particularly valuable for creative workflows.
Virtual Reality and Gaming: Generated 3D assets can be integrated into virtual environments or games.
E-commerce and Product Visualization: Creating 3D models of products from single images for online stores or augmented reality previews.
Robotics: Reconstructing 3D geometry of objects from camera input can aid in tasks like object recognition and grasping.
Computational Requirements: Training the model requires a powerful GPU cluster. Inference (single image to multi-view generation) is faster but still relies on diffusion model inference. The subsequent mesh extraction step is relatively quick due to being based on accelerated neural rendering techniques like instant-NGP.
Data Requirements: Fine-tuning requires a dataset of multi-view color images and corresponding normal maps of 3D objects. Creating such datasets involves rendering normalized 3D assets from multiple viewpoints with known camera parameters.
Limitations: The current implementation uses a limited number of views (6). This can still pose challenges for objects with complex geometry, thin structures, or severe occlusions, as unseen regions must be hallucinated. Scaling to more views would require more efficient multi-view attention mechanisms to manage computational costs.
Generalization: While the method shows good generalization to diverse object categories and styles, performance may vary depending on the object's complexity and similarity to the training data distribution.
In summary, Wonder3D translates the power of 2D diffusion models into efficient, high-quality single-view 3D reconstruction by learning to generate consistent multi-view normal maps and colors, and then leveraging these 2D signals within a specialized 3D reconstruction framework robust to potential inconsistencies in the generated views.