- The paper presents Zero123++ which generates 3D-consistent multi-view images from a single input by tiling views into a unified grid.
- It improves image quality by modifying the noise schedule and scaling reference attention to better align with the input image.
- The approach integrates global conditioning via FlexDiffuse and depth ControlNet to enhance generalization and control in out-of-domain scenarios.
The paper "Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model" presents a novel approach to generating 3D-consistent multi-view images from a single input image. It builds upon the existing Zero-1-to-3 model but addresses its limitations, particularly the inconsistency between generated views and the underutilization of Stable Diffusion's capabilities. The core idea is to train a diffusion model to generate a tiled image containing multiple views of the object from different angles, thus modeling the joint distribution of these views.
Here's a breakdown of the key contributions and techniques used:
- Multi-view Generation via Tiling: Instead of generating each view independently, Zero123++ arranges six views in a 3x2 grid within a single image, allowing the model to learn the correlations between different viewpoints and enforce consistency. Fixed absolute elevation angles and relative azimuth angles are used for the novel views to eliminate orientation ambiguity and avoid the need for elevation estimation modules.
- Improved Consistency and Stability through Noise Schedule Modification: The paper argues that Stable Diffusion's original scaled-linear noise schedule hinders the model's ability to adapt to global consistency requirements. It demonstrates that a linear noise schedule, which provides more steps at lower Signal-to-Noise Ratios (SNR), leads to improved consistency in multi-view generation. The paper also points out that the lower resolution used in Zero-1-to-3 can be interpreted as a modification of the noise schedule. The authors leverage the Stable Diffusion 2 v-prediction model because of its robustness to noise schedule changes.
- Enhanced Local Conditioning with Scaled Reference Attention: To better incorporate information from the input image, the paper introduces a scaled version of Reference Attention. This involves feeding the input image (with added noise) through the UNet and using the resulting key and value matrices from the self-attention layers to guide the denoising process. Scaling the reference latent further improves consistency with the input image.
- Global Conditioning with FlexDiffuse: The paper employs a trainable variant of FlexDiffuse to integrate global image conditioning into the model. By leveraging the alignment between CLIP image and text spaces, the CLIP image embedding is incorporated into the original prompt embeddings (even if the prompt is empty), enabling the model to infer the global semantics of the object, especially for unseen regions.
- Training Strategy: The model is trained on the Objaverse dataset using a phased training schedule, initially focusing on the self-attention layers and KV matrices of cross-attention, and then fine-tuning the full UNet with a very conservative learning rate. Min-SNR weighting is used to improve training efficiency.
- Depth ControlNet Integration: The paper also demonstrates the feasibility of training a ControlNet on Zero123++ to enable enhanced control over the generation process via depth maps.
The paper presents both qualitative and quantitative results that demonstrate the superiority of Zero123++ over existing methods like Zero-1-to-3 and SyncDreamer in terms of multi-view consistency and image quality. It also shows the model's ability to generalize to out-of-domain images, such as AI-generated and 2D illustrations. The text-to-multi-view results, using SDXL to generate the initial image, are also impressive compared to MVDream and Zero-1-to-3 XL.
Finally, the paper outlines potential future work, including the use of a two-stage generate-refine pipeline (leveraging an epsilon-parameterized model for refinement), scaling up training to larger datasets like Objaverse-XL, and exploring the use of Zero123++ for mesh reconstruction.