Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models (2312.13763v2)

Published 21 Dec 2023 in cs.CV and cs.LG

Abstract: Text-guided diffusion models have revolutionized image and video generation and have also been successfully used for optimization-based 3D object synthesis. Here, we instead focus on the underexplored text-to-4D setting and synthesize dynamic, animated 3D objects using score distillation methods with an additional temporal dimension. Compared to previous work, we pursue a novel compositional generation-based approach, and combine text-to-image, text-to-video, and 3D-aware multiview diffusion models to provide feedback during 4D object optimization, thereby simultaneously enforcing temporal consistency, high-quality visual appearance and realistic geometry. Our method, called Align Your Gaussians (AYG), leverages dynamic 3D Gaussian Splatting with deformation fields as 4D representation. Crucial to AYG is a novel method to regularize the distribution of the moving 3D Gaussians and thereby stabilize the optimization and induce motion. We also propose a motion amplification mechanism as well as a new autoregressive synthesis scheme to generate and combine multiple 4D sequences for longer generation. These techniques allow us to synthesize vivid dynamic scenes, outperform previous work qualitatively and quantitatively and achieve state-of-the-art text-to-4D performance. Due to the Gaussian 4D representation, different 4D animations can be seamlessly combined, as we demonstrate. AYG opens up promising avenues for animation, simulation and digital content creation as well as synthetic data generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (110)
  1. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. arXiv preprint arXiv:2311.17984, 2023.
  4. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  5. ediff-i: Text-to-image diffusion models with ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  6. GAUDI: A neural architect for immersive 3d scene generation. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  7. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  8. Neural surface reconstruction of dynamic scenes with monocular rgb-d camera. In Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS), 2022.
  9. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  10. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023a.
  11. It3d: Improved text-to-3d generation with explicit view synthesis. arXiv preprint arXiv:2308.11473, 2023b.
  12. Text-to-3d using gaussian splatting. arXiv preprint arXiv:2309.16585, 2023c.
  13. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384, 2023.
  14. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
  15. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023a.
  16. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  17. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  18. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, 2021.
  19. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In Proceedings of the 40th International Conference on Machine Learning, 2023.
  20. Metadreamer: Efficient text-to-3d creation with disentangling geometry and texture. arXiv preprint arXiv:2311.10123, 2023a.
  21. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  22. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  23. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  24. Matryoshka diffusion models. arXiv preprint arXiv:2310.15111, 2023.
  25. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  26. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  27. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  28. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  29. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023.
  30. Simple diffusion: End-to-end diffusion for high resolution images. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.
  31. Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422, 2023.
  32. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  33. Consistent4d: Consistent 360∘{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT dynamic object generation from monocular video. arXiv preprint arxiv:2311.02848, 2023.
  34. Tetrahedral diffusion models for 3d shape generation. arXiv preprint arXiv:2211.13220, 2022.
  35. Noise-free score distillation. arXiv preprint arXiv:2310.17590, 2023.
  36. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
  37. Isaac Kerlow. The Art of 3D Computer Animation and Effects. Wiley Publishing, 4th edition, 2009.
  38. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  39. Neuralfield-ldm: Scene generation with hierarchical latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  40. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  41. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885, 2023a.
  42. Compositional visual generation with composable diffusion models. In Computer Vision – ECCV 2022, 2022.
  43. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023b.
  44. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023c.
  45. Meshdiffusion: Score-based generative 3d mesh modeling. In International Conference on Learning Representations (ICLR), 2023d.
  46. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  47. Att3d: Amortized text-to-3d object synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  48. An iterative image registration technique with an application to stereo vision. In IJCAI’81: 7th international joint conference on Artificial intelligence, 1981.
  49. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713, 2023.
  50. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  51. Latent-nerf for shape-guided generation of 3d shapes and textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  52. ResFields: Residual neural fields for spatiotemporal signals. arXiv preprint arXiv:2309.03160, 2023.
  53. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  54. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
  55. Point-e: A system for generating 3d point clouds from complex prompts, 2022.
  56. Improved denoising diffusion probabilistic models. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021.
  57. Benchmark for compositional text-to-image synthesis. In NeurIPS Datasets and Benchmarks, 2021a.
  58. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021b.
  59. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. ACM Trans. Graph., 40(6), 2021c.
  60. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  61. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
  62. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  63. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  64. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021.
  65. Dreambooth3d: Subject-driven text-to-3d generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  66. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  67. Trace and pace: Controllable pedestrian animation via guided trajectory diffusion. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  68. Sticking the landing: Simple, lower-variance gradient estimators for variational inference. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  69. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  70. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  71. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  72. Laion-5b: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  73. Wildfusion: Learning 3d-aware latent diffusion models in view space. arXiv preprint arXiv:2311.13570, 2023.
  74. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023a.
  75. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023b.
  76. 3d neural field generation using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  77. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations (ICLR), 2023a.
  78. Text-to-4d dynamic scene generation. In Proceedings of the 40th International Conference on Machine Learning, 2023b.
  79. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015.
  80. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021.
  81. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818, 2023.
  82. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  83. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  84. Textmesh: Generation of realistic 3d meshes from text prompts. In International conference on 3D vision (3DV), 2024.
  85. Score-based generative modeling in latent space. In Neural Information Processing Systems (NeurIPS), 2021.
  86. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  87. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
  88. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  89. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023c.
  90. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023d.
  91. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023e.
  92. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023a.
  93. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023b.
  94. Physgaussian: Physics-integrated 3d gaussians for generative dynamics. arXiv preprint arXiv:2311.12198, 2023.
  95. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360° views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
  96. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217, 2023b.
  97. Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295, 2023.
  98. Banmo: Building animatable 3d neural models from many casual videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  99. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101, 2023.
  100. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arxiv:2310.08529, 2023.
  101. Text-to-3d with classifier score distillation. arXiv preprint arXiv:2310.19415, 2023.
  102. Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  103. Lion: Latent point diffusion models for 3d shape generation. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  104. Animate124: Animating one image to 4d dynamic scene. arXiv preprint arXiv:2311.14603, 2023.
  105. A unified approach for text- and image-guided 4d scene generation. arXiv preprint arXiv:2311.16854, 2023.
  106. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2023.
  107. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  108. Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint arXiv:2305.18766, 2023.
  109. Drivable 3d gaussian avatars. arXiv preprint arxiv:2311.08581, 2023.
  110. Ewa volume splatting. In Proceedings Visualization, 2001. VIS ’01., 2001.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Huan Ling (23 papers)
  2. Seung Wook Kim (23 papers)
  3. Antonio Torralba (178 papers)
  4. Sanja Fidler (184 papers)
  5. Karsten Kreis (50 papers)
Citations (84)

Summary

Align Your Gaussians: Advances in Text-to-4D Dynamic Scene Synthesis

The paper "Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models" presents an innovative approach to generating dynamic 4D content, significantly broadening the applicability of text-guided diffusion models. This research utilizes dynamic 3D Gaussian splatting as the cornerstone of its 4D representation, overlaying it with deformation fields to encapsulate temporal dynamics, thus effectively adding a fourth dimension to previously static 3D models. The resulting method, named Align Your Gaussians (AYG), capitalizes on the potential of compositional generation, blending feedback from text-to-image, text-to-video, and 3D-aware multiview diffusion models to synthesize realistic and temporally coherent 4D dynamic scenes.

Fundamentally, the paper illustrates how utilizing a combination of these models provides a more nuanced generative pipeline. The multiview image diffusion model, MVDream, serves as a prior for optimizing static 3D assets, which are crucial for initializing the 4D synthesis. The dynamic component of the framework leverages a newly trained text-to-video diffusion model with frame-rate conditioning, allowing AYG to achieve smooth, diverse, and contextually rich animations. Throughout the process, the system draws from the respective strengths of the employed models: temporal dynamics from the video model ensure motion coherence, while the image model guarantees the high-quality visual fidelity of each frame.

Key numerical results demonstrate that AYG outperforms existing methods in qualitative and quantitative benchmarks, as evidenced by user studies and detailed evaluations against MAV3D, a previously recognized approach in dynamic 4D synthesis. The AYG system, with its autoregressive capabilities, allows for the extension of animations beyond the traditional temporal limits of the baseline approaches, and its dynamic 3D Gaussian representation encourages intuitive composability in large-scale digital scenarios.

Several core innovations merit acknowledgment. First, the newly introduced motion amplification technique effectively boosts scene motion while maintaining stability within synthetic animations. This, combined with a modified JSD-based regularization scheme, ensures that the distribution of 3D Gaussians is optimally managed through the deformation process, fostering clearer and more vibrant animations. The approach prevents broad translations and scale changes, directing the model towards producing realistic motion dynamics. Furthermore, the autoregressive synthesis scheme introduced enables the extension of 4D sequences, a core advance allowing the synthesis of prolonged animations with changing text prompts, marking a first in the literature.

From a theoretical perspective, AYG broadens the scope of score distillation sampling by exploring the simultaneous use of multiple textual diffusion models. The researchers convincingly argue for the compositional utilization of these models, which they validate by achieving state-of-the-art results in dynamic 4D generation tasks. From a practical standpoint, the increased flexibility and quality in generated animations presented by AYG hold substantial implications for digital content creation, particularly in assembling complex virtual scenes and generating synthetic data for machine learning contexts.

Looking forward, AI researchers and developers could explore extending the work of this paper by addressing current limitations, such as developing methods to accommodate topological changes within the synthesized 3D shapes or extending the framework to object-centric synthesis. Moreover, future research might focus on integrating image-to-3D translation techniques to leverage personalized 3D representations within AYG's dynamic synthesis pipeline, possibly expanding the system's generative capabilities into areas of personalized virtual reality and augmented reality environments.

In conclusion, "Align Your Gaussians" advances the frontiers of text-driven 4D content generation, proposing innovations that significantly enhance the robustness and realism of animated digital scenes. Its creative yet analytically rigorous use of dynamic 3D Gaussian-based representations, along with meticulously composed diffusion models, represents a substantial stride in AI-driven generative modeling.

Youtube Logo Streamline Icon: https://streamlinehq.com