- The paper introduces CMD, a framework that reformulates 3D generation as a conditional multiview diffusion task to enable precise local edits.
- It employs a two-stage approach: Conditional Multiview Generation with MVControlNet and iterative, differentiable mesh reconstruction for efficient 3D updates.
- CMD achieves rapid local editing (around 20 seconds per edit) and progressive 3D generation by integrating global context for consistent part synthesis.
Here is a summary of the paper "CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation" (2505.07003), focusing on its practical implementation and applications.
The paper introduces CMD (Controllable Multiview Diffusion), a novel framework designed to address the limitations of existing 3D generation methods regarding flexible local editing and the generation of complex 3D assets. Traditional methods often require regenerating an entire 3D model even for minor changes to the input, lacking granular control. CMD formulates 3D generation and editing as a conditional multiview diffusion task, allowing for part-by-part generation and localized editing guided by image modifications.
The core of CMD consists of two main stages:
- Conditional Multiview Generation (CondMV):
- This is a key component built upon a cross-modality multiview diffusion model. It extends existing multiview diffusion models (which typically take only a single image or text prompt) to accept multiview conditions.
- Inputs: CondMV takes two types of input:
- A single-view target image (
x
), which is typically an edited version of one rendered view of the 3D model.
- A set of multiview conditions (
{c_i}_{i=1}^N
), consisting of color and normal maps ({p_i}_{i=1}^N
, {n_i}_{i=1}^N
) rendered from the existing 3D model (Gamma
). These conditions represent the original state of the 3D model from different viewpoints.
- Output: The model generates a new set of multiview color and normal maps (
{c'_i}_{i=1}^N
) from predefined viewpoints. The crucial aspect is that these generated views are consistent with the original model's views in unedited areas but reflect the changes introduced in the target image (x
) in the edited region.
- MVControlNet Implementation: To achieve this conditional generation, CMD uses a Multiview ControlNet (MVControlNet), inspired by the original ControlNet. The MVControlNet mirrors the structure of the base diffusion model's UNet. Multiview conditions (
{c_i}_{i=1}^N
) are processed by a small CNN and then injected into the UNet's encoder and decoder layers via zero-convolution layers. Unlike standard ControlNet applications, CMD finetunes both the base UNet and the MVControlNet simultaneously using the diffusion loss. This allows the model to learn to automatically identify and modify only the regions corresponding to the edits in the target image while preserving others, without explicit masking or 3D guidance.
- Incremental Reconstruction:
- After generating the edited multiview color and normal maps (
{c'_i}_{i=1}^N
), these 2D images are lifted back to a 3D mesh (Gamma'
).
- CMD employs a differentiable rendering-based approach using continuous remeshing.
Optimization: The process initializes with the original mesh (Gamma
) and optimizes its geometry by minimizing a loss function that compares the rendered views of the current mesh state with the generated target multiviews:
Lrecon=L2(ni′,n′^i)+L2(αi,α^i)+λLsmooth
where ni′ and n′^i are generated and rendered normal maps, αi and α^i are generated and rendered foreground masks, and Lsmooth is a Laplacian regularization to maintain mesh quality.
Incremental Strategy: This approach is incremental. Instead of reconstructing the entire mesh from scratch, it starts from the previous mesh and iteratively refines it, primarily affecting vertices and faces in the edited region. The topology is adaptively refined through face splitting and merging during optimization.
Texturing: Once the geometry is reconstructed, the generated color maps ({p'_i}_{i=1}^N
) are baked onto the mesh to produce the final textured 3D model.
Practical Applications
CMD is applied to two main real-world scenarios:
- Local 3D Editing:
- Given an existing textured mesh, users can select a rendered view (e.g., the 0∘ frontal view), edit it using any standard 2D image editing tool (e.g., adding details, changing color, removing parts), and use this edited view as the target image (
x
).
- The original mesh's multiview renderings serve as conditions.
- CMD then generates consistent multiviews and reconstructs the edited mesh. This process is notably efficient, completing edits in approximately 20 seconds according to the paper, significantly faster than many SDS-based methods that take tens of minutes.
- This allows intuitive, image-based control over 3D geometry and texture modifications without needing manual 3D selection or explicit 3D guidance.
- Progressive 3D Generation:
- CMD enables generating complex 3D models from a single image by breaking the task down into generating parts sequentially.
- Workflow: The process starts by segmenting the input single image into multiple parts (e.g., using a model like SAM).
- Generation proceeds step-by-step. The first part is generated from scratch (using white images as initial conditions). Subsequent parts are generated or added based on the previously generated/reconstructed parts, which serve as multiview conditions for the next step.
- Global Condition Implementation: To ensure spatial consistency and correct part placement/size across steps, CMD introduces a "Global Condition". The latent representation of the entire final target image is provided as an additional condition at each step of the progressive generation. This provides global context, preventing errors in part layout that could occur if only the current step's partial image was used. This is implemented by concatenating the latent features of the current step's image and the global target image and feeding them into the diffusion model.
Implementation Considerations and Performance
- Training Data: The model is trained on a large dataset derived from Objaverse, augmented with part-level manipulation and object composition to teach the model about local changes and occlusion.
- Efficiency: A major practical advantage is the speed. The inference involves a single diffusion forward pass (20 DDIM steps) followed by fast incremental reconstruction (continuous remeshing), taking roughly 20 seconds total. This makes iterative editing much more feasible.
- Resource Requirements: Training requires significant resources (e.g., 4 H800 GPUs), common for large diffusion models. Inference is relatively efficient on standard hardware capable of running diffusion models.
- Limitations: The paper notes limitations, including reliance on external image editing tools (artifacts in the input edit can propagate) and the fact that the incremental reconstruction might not perfectly preserve the exact topology of the original mesh, even while maintaining geometric consistency. Future work could explore automatically localizing editable regions.
In summary, CMD provides a practical framework for controllable 3D content creation, offering efficient local editing guided by simple 2D image edits and enabling the generation of complex 3D structures progressively. Its key technical contributions lie in the conditional multiview diffusion model leveraging MVControlNet for image-based control and the incremental reconstruction strategy for efficient 3D updates.