Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models (2404.02148v4)

Published 2 Apr 2024 in cs.CV

Abstract: Recent advancements in 3D generation are predominantly propelled by improvements in 3D-aware image diffusion models. These models are pretrained on Internet-scale image data and fine-tuned on massive 3D data, offering the capability of producing highly consistent multi-view images. However, due to the scarcity of synchronized multi-view video data, it remains challenging to adapt this paradigm to 4D generation directly. Despite that, the available video and 3D data are adequate for training video and multi-view diffusion models separately that can provide satisfactory dynamic and geometric priors respectively. To take advantage of both, this paper presents Diffusion$2$, a novel framework for dynamic 3D content creation that reconciles the knowledge about geometric consistency and temporal smoothness from these models to directly sample dense multi-view multi-frame images which can be employed to optimize continuous 4D representation. Specifically, we design a simple yet effective denoising strategy via score composition of pretrained video and multi-view diffusion models based on the probability structure of the target image array. To alleviate the potential conflicts between two heterogeneous scores, we further introduce variance-reducing sampling via interpolated steps, facilitating smooth and stable generation. Owing to the high parallelism of the proposed image generation process and the efficiency of the modern 4D reconstruction pipeline, our framework can generate 4D content within few minutes. Notably, our method circumvents the reliance on expensive and hard-to-scale 4D data, thereby having the potential to benefit from the scaling of the foundation video and multi-view diffusion models. Extensive experiments demonstrate the efficacy of our proposed framework in generating highly seamless and consistent 4D assets under various types of conditions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. In CVPR, 2024.
  2. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint, 2023.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  4. Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators, 2024.
  5. Hexplane: A fast representation for dynamic scenes. In CVPR, 2023.
  6. Tensorf: Tensorial radiance fields. In ECCV, 2022.
  7. V3d: Video diffusion models are effective 3d generators. arXiv preprint, 2024.
  8. Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
  9. Get3d: A generative model of high quality 3d textured shapes learned from images. In NeurIPS, 2022.
  10. Gaussianflow: Splatting gaussian dynamics for 4d content creation. arXiv preprint, 2024.
  11. Photorealistic video generation with diffusion models. arXiv preprint, 2023.
  12. Vfusion3d: Learning scalable 3d generative models from video diffusion models. arXiv preprint, 2024.
  13. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  14. Lrm: Large reconstruction model for single image to 3d. In ICLR, 2024.
  15. Consistent4d: Consistent 360 {{\{{\\\backslash\deg}}\}} dynamic object generation from monocular video. In ICLR, 2024.
  16. Shap-e: Generating conditional 3d implicit functions. arXiv preprint, 2023.
  17. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
  18. 3d gaussian splatting for real-time radiance field rendering. In ACM TOG, 2023.
  19. Sora generates videos with stunning geometrical consistency. arXiv preprint, 2024.
  20. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023.
  21. Syncdreamer: Generating multiview-consistent images from a single-view image. In ICLR, 2024.
  22. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint, 2023.
  23. Vdt: General-purpose video diffusion transformers via mask modeling. In ICLR, 2024.
  24. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint, 2022.
  25. Enhancing high-resolution 3d generation through pixel-wise gradient clipping. In ICLR, 2024.
  26. Fast dynamic 3d object generation from a single-view video. arXiv preprint, 2024.
  27. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
  28. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, 2021.
  29. Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint, 2023.
  30. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  31. Graf: Generative radiance fields for 3d-aware image synthesis. In NeurIPS, 2020.
  32. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint, 2023.
  33. Text-to-4d dynamic scene generation. arXiv preprint, 2023.
  34. Denoising diffusion implicit models. In ICLR, 2021.
  35. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  36. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint, 2024.
  37. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. In ICLR, 2024.
  38. Mvdiffusion++: A dense high-resolution multi-view diffusion model for single or sparse-view 3d object reconstruction. arXiv preprint, 2024.
  39. Neural discrete representation learning. In NeurIPS, 2017.
  40. Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion. arXiv preprint, 2024.
  41. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint, 2023.
  42. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023.
  43. Crm: Single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint, 2024.
  44. Image quality assessment: from error visibility to structural similarity. In IEEE TIP, 2004.
  45. 4d gaussian splatting for real-time dynamic scene rendering. In CVPR, 2024.
  46. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In CVPR, 2023.
  47. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In CVPR, 2024.
  48. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. In CVPR, 2024.
  49. 4dgen: Grounded 4d content generation with spatial-temporal consistency. arXiv preprint, 2023.
  50. Mvimgnet: A large-scale dataset of multi-view images. In CVPR, 2023.
  51. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  52. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  53. Animate124: Animating one image to 4d dynamic scene. arXiv preprint, 2023.
  54. Videomv: Consistent multi-view generation based on large video generative model. arXiv preprint, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zeyu Yang (27 papers)
  2. Zijie Pan (14 papers)
  3. Chun Gu (15 papers)
  4. Li Zhang (690 papers)
Citations (5)
Reddit Logo Streamline Icon: https://streamlinehq.com