Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion (2310.10343v1)

Published 16 Oct 2023 in cs.CV

Abstract: Given a single image of a 3D object, this paper proposes a novel method (named ConsistNet) that is able to generate multiple images of the same object, as if seen they are captured from different viewpoints, while the 3D (multi-view) consistencies among those multiple generated images are effectively exploited. Central to our method is a multi-view consistency block which enables information exchange across multiple single-view diffusion processes based on the underlying multi-view geometry principles. ConsistNet is an extension to the standard latent diffusion model, and consists of two sub-modules: (a) a view aggregation module that unprojects multi-view features into global 3D volumes and infer consistency, and (b) a ray aggregation module that samples and aggregate 3D consistent features back to each view to enforce consistency. Our approach departs from previous methods in multi-view image generation, in that it can be easily dropped-in pre-trained LDMs without requiring explicit pixel correspondences or depth prediction. Experiments show that our method effectively learns 3D consistency over a frozen Zero123 backbone and can generate 16 surrounding views of the object within 40 seconds on a single A100 GPU. Our code will be made available on https://github.com/JiayuYANG/ConsistNet

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  2. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  3. GeNVS: Generative novel view synthesis with 3D-aware diffusion models. In arXiv, 2023.
  4. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133, 2021.
  5. Explicit correspondence matching for generalizable neural radiance fields. arXiv preprint arXiv:2304.12294, 2023.
  6. Hierarchical integration diffusion model for realistic image deblurring. arXiv preprint arXiv:2305.12966, 2023.
  7. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
  8. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  9. Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items, 2022.
  10. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  11. Implicit diffusion models for continuous super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10021–10030, 2023.
  12. The lumigraph. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 453–464. 2023.
  13. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  14. Delta denoising score. 2023.
  15. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  16. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  18. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  19. Light field rendering. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 441–452. 2023.
  20. Magicedit: High-fidelity and temporally coherent video editing. arXiv preprint arXiv:2308.14749, 2023.
  21. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023.
  22. Syncdreamer: Learning to generate multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023.
  23. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  24. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  25. Dreamfusion: Text-to-3d using 2d diffusion, 2022.
  26. Volrecon: Volume rendering of signed ray distance functions for generalizable multi-view reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16685–16695, 2023.
  27. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  28. U-net: Convolutional networks for biomedical image segmentation, 2015.
  29. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
  30. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  31. Self-supervised visibility learning for novel view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9675–9684, 2021.
  32. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  33. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  34. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  35. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  36. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  37. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint arXiv:2307.01097, 2023.
  38. Consistent view synthesis with pose-guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16773–16783, 2023.
  39. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  40. Is attention all nerf needs? arXiv preprint arXiv:2207.13298, 2022.
  41. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2021.
  42. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  43. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402. Ieee, 2003.
  44. Deblurring via stochastic refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16293–16303, 2022.
  45. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  46. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  47. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10146–10156, 2023.
  48. Sparse3d: Distilling multiview-consistent diffusion for object reconstruction from sparse views. arXiv preprint arXiv:2308.14078, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jiayu Yang (32 papers)
  2. Ziang Cheng (10 papers)
  3. Yunfei Duan (3 papers)
  4. Pan Ji (53 papers)
  5. Hongdong Li (172 papers)
Citations (46)

Summary

Overview of ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion

The research presented in the paper titled "ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion" explores a novel approach to address the challenges of generating 3D-consistent images from single-view images using diffusion models. The paper introduces ConsistNet, a plug-in module specifically designed to enforce 3D consistency in multi-view image generation without the need for explicit pixel correspondences or depth predictions. The efficacy of this model is evaluated using a variety of metrics, and its integration into existing pre-trained latent diffusion models (LDMs) demonstrates significant improvements in 3D consistency over prior methodologies.

Technical Contributions

The central contribution of the paper is the development of ConsistNet, a plug-in module that enhances the capabilities of existing diffusion models, such as Zero123, to produce 3D consistent multi-view images. The architecture of ConsistNet is built around two core sub-modules: the view aggregation module and the ray aggregation module.

  • View Aggregation Module: The view aggregation module unprojects the feature maps from each viewpoint into 3D volumes using inverse camera projection and encodes positional information. It then leverages multi-headed self-attention mechanisms across the aggregated volumes to ensure the information exchange aligns with multi-view geometric constraints.
  • Ray Aggregation Module: This module samples consistent 3D features back into each viewpoint's frame, enforcing consistency through a cross-attention mechanism. It essentially acts as a feedback loop, integrating refined 3D consistent features re-projected into the individual image space, influencing the subsequent output of the diffusion model.

ConsistNet is designed to be lightweight and is implemented using trainable weights initialized to be zero, conducive to fast training on existing networks without extensively tuning the pre-trained components of the backbone models, such as Zero123.

Empirical Evaluation

The proposed method's effectiveness is evidentially captured through evaluations conducted on the Objaverse and Google Scanned Objects datasets. In these evaluations, various standard metrics are used, including LPIPS, SSIM, and PSNR, across different elevation angles. The experiments underscore ConsistNet's superiority over conventional diffusion methods, yielding marked improvements in perceptual and structural image quality measures such as LPIPS (using both AlexNet and VGG backbones) and SSIM, underlining its enhanced ability to maintain geometrical consistency over multiple views.

Comparative Analysis

Compared against baseline models like the vanilla Zero123 and other integrated models, such as DreamFusion combined with Zero123 and SyncDreamer, ConsistNet consistently demonstrates superior performance, especially notable in scenarios involving low elevation angles (0 and 15 degrees). It manages to maintain this robust performance while ensuring efficiency, generating multi-view images significantly faster than DreamFusion's approach.

Implications and Future Work

The practical and theoretical implications of ConsistNet are significant, especially for applications in virtual and augmented reality, where 3D consistency is vital. This method not only enhances the generation of 3D assets for these applications but also paves the way for more advanced research in image-based 3D reconstruction using diffusion models. The authors also suggest potential future research directions, such as further optimizations for computational efficiency and the integration of 3D mesh reconstruction capabilities during diffusion processes.

ConsistNet represents a methodical stride toward improving the quality and consistency of multi-view image generation frameworks and offers a scalable solution to integrating 3D consistency in existing systems without overhauling pre-trained architectures. These advancements set the stage for ongoing research and development in enhancing the visual coherence and realism of generated 3D environments.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com