ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion (2310.10343v1)

Published 16 Oct 2023 in cs.CV

Abstract: Given a single image of a 3D object, this paper proposes a novel method (named ConsistNet) that is able to generate multiple images of the same object, as if seen they are captured from different viewpoints, while the 3D (multi-view) consistencies among those multiple generated images are effectively exploited. Central to our method is a multi-view consistency block which enables information exchange across multiple single-view diffusion processes based on the underlying multi-view geometry principles. ConsistNet is an extension to the standard latent diffusion model, and consists of two sub-modules: (a) a view aggregation module that unprojects multi-view features into global 3D volumes and infer consistency, and (b) a ray aggregation module that samples and aggregate 3D consistent features back to each view to enforce consistency. Our approach departs from previous methods in multi-view image generation, in that it can be easily dropped-in pre-trained LDMs without requiring explicit pixel correspondences or depth prediction. Experiments show that our method effectively learns 3D consistency over a frozen Zero123 backbone and can generate 16 surrounding views of the object within 40 seconds on a single A100 GPU. Our code will be made available on https://github.com/JiayuYANG/ConsistNet

References (48)

Authors (5)

Jiayu Yang (32 papers)
Ziang Cheng (10 papers)
Yunfei Duan (3 papers)
Pan Ji (53 papers)
Hongdong Li (172 papers)

Citations (46)

View on Semantic Scholar

Summary

Overview of ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion

The research presented in the paper titled "ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion" explores a novel approach to address the challenges of generating 3D-consistent images from single-view images using diffusion models. The paper introduces ConsistNet, a plug-in module specifically designed to enforce 3D consistency in multi-view image generation without the need for explicit pixel correspondences or depth predictions. The efficacy of this model is evaluated using a variety of metrics, and its integration into existing pre-trained latent diffusion models (LDMs) demonstrates significant improvements in 3D consistency over prior methodologies.

Technical Contributions

The central contribution of the paper is the development of ConsistNet, a plug-in module that enhances the capabilities of existing diffusion models, such as Zero123, to produce 3D consistent multi-view images. The architecture of ConsistNet is built around two core sub-modules: the view aggregation module and the ray aggregation module.

View Aggregation Module: The view aggregation module unprojects the feature maps from each viewpoint into 3D volumes using inverse camera projection and encodes positional information. It then leverages multi-headed self-attention mechanisms across the aggregated volumes to ensure the information exchange aligns with multi-view geometric constraints.
Ray Aggregation Module: This module samples consistent 3D features back into each viewpoint's frame, enforcing consistency through a cross-attention mechanism. It essentially acts as a feedback loop, integrating refined 3D consistent features re-projected into the individual image space, influencing the subsequent output of the diffusion model.

ConsistNet is designed to be lightweight and is implemented using trainable weights initialized to be zero, conducive to fast training on existing networks without extensively tuning the pre-trained components of the backbone models, such as Zero123.

Empirical Evaluation

The proposed method's effectiveness is evidentially captured through evaluations conducted on the Objaverse and Google Scanned Objects datasets. In these evaluations, various standard metrics are used, including LPIPS, SSIM, and PSNR, across different elevation angles. The experiments underscore ConsistNet's superiority over conventional diffusion methods, yielding marked improvements in perceptual and structural image quality measures such as LPIPS (using both AlexNet and VGG backbones) and SSIM, underlining its enhanced ability to maintain geometrical consistency over multiple views.

Comparative Analysis

Compared against baseline models like the vanilla Zero123 and other integrated models, such as DreamFusion combined with Zero123 and SyncDreamer, ConsistNet consistently demonstrates superior performance, especially notable in scenarios involving low elevation angles (0 and 15 degrees). It manages to maintain this robust performance while ensuring efficiency, generating multi-view images significantly faster than DreamFusion's approach.

Implications and Future Work

The practical and theoretical implications of ConsistNet are significant, especially for applications in virtual and augmented reality, where 3D consistency is vital. This method not only enhances the generation of 3D assets for these applications but also paves the way for more advanced research in image-based 3D reconstruction using diffusion models. The authors also suggest potential future research directions, such as further optimizations for computational efficiency and the integration of 3D mesh reconstruction capabilities during diffusion processes.

ConsistNet represents a methodical stride toward improving the quality and consistency of multi-view image generation frameworks and offers a scalable solution to integrating 3D consistency in existing systems without overhauling pre-trained architectures. These advancements set the stage for ongoing research and development in enhancing the visual coherence and realism of generated 3D environments.

PDF Markdown