Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency (2312.17225v3)

Published 28 Dec 2023 in cs.CV

Abstract: Aided by text-to-image and text-to-video diffusion models, existing 4D content creation pipelines utilize score distillation sampling to optimize the entire dynamic 3D scene. However, as these pipelines generate 4D content from text or image inputs directly, they are constrained by limited motion capabilities and depend on unreliable prompt engineering for desired results. To address these problems, this work introduces \textbf{4DGen}, a novel framework for grounded 4D content creation. We identify monocular video sequences as a key component in constructing the 4D content. Our pipeline facilitates controllable 4D generation, enabling users to specify the motion via monocular video or adopt image-to-video generations, thus offering superior control over content creation. Furthermore, we construct our 4D representation using dynamic 3D Gaussians, which permits efficient, high-resolution supervision through rendering during training, thereby facilitating high-quality 4D generation. Additionally, we employ spatial-temporal pseudo labels on anchor frames, along with seamless consistency priors implemented through 3D-aware score distillation sampling and smoothness regularizations. Compared to existing video-to-4D baselines, our approach yields superior results in faithfully reconstructing input signals and realistically inferring renderings from novel viewpoints and timesteps. More importantly, compared to previous image-to-4D and text-to-4D works, 4DGen supports grounded generation, offering users enhanced control and improved motion generation capabilities, a feature difficult to achieve with previous methods. Project page: https://vita-group.github.io/4DGen/

Introduction to 4D Content Generation

The creation of dynamic 3D content, often referred to as 4D, has become a pivotal area of research due to the increasing demand for content with both spatial and temporal dimensions. Traditional methods generally rely on intensive prompt engineering and high computational costs, which can lead to significant obstacles in practical applications. Acknowledging the limitations of existing systems, this paper introduces a new approach to 4D content generation that aims to streamline and enhance the overall process.

A Novel Multi-Stage 4D Generation Pipeline

At the heart of this method lies a multi-stage generation pipeline that simplifies the complexity of creating 4D content. By decomposing the process into distinct stages, the method targets static 3D assets and monocular video sequences as the core components for constructing the 4D scene. This design offers users the unprecedented ability to direct the geometry and motion of the content, allowing for the specification of both appearance and dynamics through a static 3D asset or video input.

The innovation extends further with the adoption of dynamic 3D Gaussians for 4D representation, which contributes to high-quality, high-resolution supervision during training. Spatial-temporal pseudo labels and consistency priors are also integrated into this framework, enhancing the plausibility of renderings from any viewpoint at any point in time.

Embracing Spatial-Temporal Consistency

Recognizing the challenge of generating content that is not only visually appealing but also consistent across time and space, the authors have employed a combination of techniques to address this issue. Pseudo labels on anchor frames—drawn from a pre-trained diffusion model—are utilized to educate the representation on spatial-temporal dimensions, while seamless consistency priors adopted from score distillation sampling and unsupervised smoothness regularization reinforce the temporal coherence of intermediate frame renderings.

Advancements and Experimental Results

The freshly proposed framework evidently outperforms existing methods in both spatial and temporal metrics, yielding more detailed renderings with smoother transitions across frames. Experimentation across various datasets validates the superiority of this approach in faithfully reconstructing input signals and delivering plausible synthesis for unseen viewpoints and timeframes.

In summary, the newly presented 4DGen system profoundly enhances user control and simplifies the content generation process, marking a significant stride forward in the field of dynamic 3D asset generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Sketchfab. https://sketchfab.com/.
  2. Learning representations and generative models for 3d point clouds. In International conference on machine learning, pages 40–49. PMLR, 2018.
  3. Stability AI. Stable video diffusion: Scaling latent video diffusion models to large datasets. https://stability.ai/research/stable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets, 2023.
  4. Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968, 2023.
  5. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021.
  6. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  7. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023.
  8. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
  9. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  10. Text-to-3d using gaussian splatting. arXiv preprint arXiv:2309.16585, 2023.
  11. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
  12. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. arXiv preprint arXiv:2212.03267, 2022.
  13. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  14. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12479–12488, 2023.
  15. Multiresolution tree networks for 3d point cloud processing. In Proceedings of the European Conference on Computer Vision (ECCV), pages 103–118, 2018.
  16. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35:31841–31854, 2022.
  17. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
  18. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  19. Deepcap: Monocular human performance capture using weak supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5052–5063, 2020.
  20. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  21. Debiasing scores and prompts of 2d diffusion for robust text-to-3d generation. arXiv preprint arXiv:2303.15413, 2023a.
  22. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023b.
  23. Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422, 2023.
  24. TeCH: Text-guided Reconstruction of Lifelike Clothed Humans. In International Conference on 3D Vision (3DV), 2024.
  25. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 867–876, 2022.
  26. Farm3d: Learning articulated 3d animals by distilling 2d diffusion. arXiv preprint arXiv:2304.10535, 2023.
  27. Efficient-3dim: Learning a generalizable single-image novel-view synthesizer in one day. arXiv preprint arXiv:2310.03015, 2023a.
  28. Consistent4d: Consistent 360° dynamic object generation from monocular video. arxiv, 2023b.
  29. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4):1–14, 2023.
  30. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  31. Collaborative score distillation for consistent visual synthesis. arXiv preprint arXiv:2307.04787, 2023.
  32. Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596, 2023.
  33. Tada! text to animatable digital avatars. arXiv preprint arXiv:2308.10899, 2023.
  34. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
  35. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023a.
  36. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023b.
  37. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023c.
  38. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023d.
  39. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023.
  40. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713, 2023.
  41. A variational perspective on solving inverse problems with diffusion models. arXiv preprint arXiv:2305.04391, 2023.
  42. Realfusion: 360deg reconstruction of any object from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8446–8455, 2023.
  43. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  44. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia 2022 conference papers, pages 1–8, 2022.
  45. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  46. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  47. Convolutional generation of textured 3d meshes. Advances in Neural Information Processing Systems, 33:870–882, 2020.
  48. Learning generative models of textured 3d meshes from real-world images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13879–13889, 2021.
  49. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  50. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  51. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  52. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  53. Trace and pace: Controllable pedestrian animation via guided trajectory diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13756–13766, 2023.
  54. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  55. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 2022.
  56. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  57. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
  58. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems, 34:6087–6101, 2021.
  59. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023.
  60. 3d point cloud generative adversarial network based on tree structured graph convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3859–3868, 2019.
  61. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  62. Text-to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280, 2023.
  63. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  64. Learning localized generative models for 3d point clouds via graph convolution. In International conference on learning representations, 2018.
  65. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022.
  66. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023a.
  67. Learning human dynamics in autonomous driving scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20796–20806, 2023b.
  68. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
  69. Tracking everything everywhere all at once. arXiv preprint arXiv:2306.05422, 2023c.
  70. Neus2: Fast learning of neural implicit surfaces for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3295–3306, 2023d.
  71. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023e.
  72. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023f.
  73. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022.
  74. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023a.
  75. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023b.
  76. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023c.
  77. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360 {{\{{\\\backslash\deg}}\}} views. arXiv preprint arXiv:2211.16431, 2022.
  78. Monoperfcap: Human performance capture from monocular video. ACM Transactions on Graphics (ToG), 37(2):1–15, 2018.
  79. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217, 2023.
  80. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 206–215, 2018.
  81. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529, 2023.
  82. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9150–9161, 2023.
  83. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
  84. Learning physically simulated tennis skills from broadcast videos. ACM Transactions On Graphics (TOG), 42(4):1–14, 2023a.
  85. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  86. Seeing a rose in five thousand ways. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 962–971, 2023b.
  87. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5826–5835, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuyang Yin (8 papers)
  2. Dejia Xu (37 papers)
  3. Zhangyang Wang (374 papers)
  4. Yao Zhao (272 papers)
  5. Yunchao Wei (151 papers)
Citations (53)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com