Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation (2312.02201v1)

Published 2 Dec 2023 in cs.CV

Abstract: We introduce "ImageDream," an innovative image-prompt, multi-view diffusion model for 3D object generation. ImageDream stands out for its ability to produce 3D models of higher quality compared to existing state-of-the-art, image-conditioned methods. Our approach utilizes a canonical camera coordination for the objects in images, improving visual geometry accuracy. The model is designed with various levels of control at each block inside the diffusion model based on the input image, where global control shapes the overall object layout and local control fine-tunes the image details. The effectiveness of ImageDream is demonstrated through extensive evaluations using a standard prompt list. For more information, visit our project page at https://Image-Dream.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. stable-diffusion-xl-base-1.0. https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0. Accessed: 2023-08-29.
  2. Stable diffusion image variation. https://huggingface.co/spaces/lambdalabs/stable-diffusion-image-variations.
  3. Efficient geometry-aware 3d generative adversarial networks. In CVPR, 2022.
  4. GeNVS: Generative novel view synthesis with 3D-aware diffusion models. In arXiv, 2023.
  5. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  6. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv:2303.13873, 2023.
  7. Objaverse-xl: A universe of 10m+ 3d objects. 2023a.
  8. Objaverse: A universe of annotated 3d objects. In CVPR, pages 13142–13153, 2023b.
  9. Gram: Generative radiance manifolds for 3d-aware image generation. In CVPR, pages 10673–10683, 2022.
  10. Get3d: A generative model of high quality 3d textured shapes learned from images. NeurIPS, 2022.
  11. Learning single-image 3d reconstruction by generative modelling of shape, pose and shading. International Journal of Computer Vision, 2020.
  12. Leveraging 2d data to learn textured 3d mesh generation. In CVPR, 2020.
  13. Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv:2306.12422, 2023.
  14. Shap-e: Generating conditional 3d implicit functions. arXiv:2305.02463, 2023.
  15. Holodiffusion: Training a 3d diffusion model using 2d images. In CVPR, 2023.
  16. Auto-encoding variational bayes. In ICLR, 2014.
  17. Magic3d: High-resolution text-to-3d content creation. In CVPR, 2023a.
  18. Common diffusion noise schedules and sample steps are flawed. arXiv:2305.08891, 2023b.
  19. Zero-1-to-3: Zero-shot one image to 3d object. arXiv:2303.11328, 2023a.
  20. Syncdreamer: Learning to generate multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023b.
  21. Wonder3d: Single image to 3d using cross-domain diffusion, 2023.
  22. Realfusion: 360deg reconstruction of any object from a single image. In CVPR, 2023.
  23. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2021.
  24. Hologan: Unsupervised learning of 3d representations from natural images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
  25. Blockgan: Learning 3d object-aware scene representations from unlabelled images. NeurIPS, 2020.
  26. Point-e: A system for generating 3d point clouds from complex prompts. arXiv:2212.08751, 2022.
  27. Giraffe: Representing scenes as compositional generative neural feature fields. In CVPR, 2021.
  28. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In CVPR, 2022.
  29. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv:2307.01952, 2023.
  30. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
  31. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  32. Learning transferable visual models from natural language supervision. In ICML, 2021.
  33. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  34. Improved techniques for training gans. NeurIPS, 2016.
  35. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  36. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023a.
  37. Mvdream: Multi-view diffusion for 3d generation. arXiv:2308.16512, 2023b.
  38. 3d neural field generation using triplane diffusion. In CVPR, 2023.
  39. Scene representation networks: Continuous 3d-structure-aware neural scene representations. NeurIPS, 32, 2019.
  40. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior, 2023.
  41. Viewset diffusion:(0-) image-conditioned 3d generative models from 2d data, 2023.
  42. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv:2303.14184, 2023.
  43. Textmesh: Generation of realistic 3d meshes from text prompts. arXiv:2304.12439, 2023.
  44. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023a.
  45. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In CVPR, 2023b.
  46. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv:2305.16213, 2023c.
  47. Novel view synthesis with diffusion models. In ICLR, 2023.
  48. Multiview compressive coding for 3d reconstruction. In CVPR, 2023.
  49. Sinnerf: Training neural radiance fields on complex scenes from a single image. 2022a.
  50. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360° views. 2022b.
  51. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023a.
  52. Consistent-1-to-3: Consistent image to 3d view synthesis via geometry-aware diffusion models. arXiv preprint arXiv:2310.03020, 2023b.
  53. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In CVPR, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Peng Wang (832 papers)
  2. Yichun Shi (40 papers)
Citations (109)

Summary

ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation

The paper "ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation" presents a detailed investigation into generating high-quality 3D models from a single image input. The research introduces ImageDream, a multi-view diffusion model emphasizing precise 3D geometry, overcoming existing limitations in state-of-the-art (SoTA) approaches such as Magic123 and MVDream.

Methodological Advancements

ImageDream capitalizes on three major areas: canonical camera coordination, a multi-level image-prompt controller, and integration of advanced diffusion networks with Neural Radiance Fields (NeRF). ImageDream's methodology includes:

  1. Canonical Camera Coordination: Unlike many existing models that rely on relative camera setups, ImageDream employs canonical coordination, simplifying the mapping from 2D images to 3D models and enhancing geometric accuracy.
  2. Multi-level Image-Prompt Controller: This includes global, local, and pixel controllers that work in tandem to efficiently manage image features and incorporate them into the diffusion process. The controllers nourish the network with varying granular control inputs, where global controls manage layout, and local and pixel controllers ensure textural fidelity.
  3. Integration with NeRF: Score Distillation Sampling (SDS) is utilized, incorporating multi-view consistency to refine 3D outputs, ultimately leading to highly robust and geometrically faithful 3D assets.

Numerical and Qualitative Analysis

The paper delivers robust empirical evidence supporting ImageDream's superiority. Quantitative assessments, such as the Quality-only Inception Score (QIS) and CLIP scores, indicate substantial improvements in image alignment and reproduced quality over its predecessors. Zero123-XL scores highest in certain metrics, yet using a much larger dataset, whereas ImageDream achieves comparably high scores with significantly smaller training data.

Qualitative analyses through user studies corroborate these findings, illustrating that ImageDream is consistently favored over alternatives for generating geometrically accurate and visually appealing 3D models from images.

Implications and Future Work

ImageDream demonstrates significant strides in overcoming common pitfalls in 3D generation such as inconsistency and lack of detail. The enhanced ability to faithfully translate single 2D images into coherent 3D models opens up practical applications in industries like gaming, film, and virtual reality.

Future developments may focus on improving detailed texture synthesis and accommodating more variable input parameters through enhanced training methodologies or novel architectural frameworks. Additionally, adaptation to include recent models like SDXL could yield even finer results, indicative of the continuous evolution in AI-driven 3D content creation.

Conclusion

ImageDream stands out as a significant contribution in the domain of 3D generation from images, refining the integration of visual data into structured 3D outputs. The research sets a promising foundation for future explorations into image-prompt driven 3D creation, paving the way for more nuanced and realistic digital environments.