Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Animate124: Animating One Image to 4D Dynamic Scene (2311.14603v2)

Published 24 Nov 2023 in cs.CV

Abstract: We introduce Animate124 (Animate-one-image-to-4D), the first work to animate a single in-the-wild image into 3D video through textual motion descriptions, an underexplored problem with significant applications. Our 4D generation leverages an advanced 4D grid dynamic Neural Radiance Field (NeRF) model, optimized in three distinct stages using multiple diffusion priors. Initially, a static model is optimized using the reference image, guided by 2D and 3D diffusion priors, which serves as the initialization for the dynamic NeRF. Subsequently, a video diffusion model is employed to learn the motion specific to the subject. However, the object in the 3D videos tends to drift away from the reference image over time. This drift is mainly due to the misalignment between the text prompt and the reference image in the video diffusion model. In the final stage, a personalized diffusion prior is therefore utilized to address the semantic drift. As the pioneering image-text-to-4D generation framework, our method demonstrates significant advancements over existing baselines, evidenced by comprehensive quantitative and qualitative assessments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In CVPR, pages 5855–5864, 2021.
  2. Face recognition based on fitting a 3d morphable model. IEEE TPAMI, 25(9):1063–1074, 2003.
  3. A 3d morphable model learnt from 10,000 faces. In CVPR, pages 5543–5552, 2016.
  4. Hexplane: A fast representation for dynamic scenes. In CVPR, 2023.
  5. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
  6. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023b.
  7. Structure and content-guided video synthesis with diffusion models. In ICCV, 2023.
  8. K-planes: Explicit radiance fields in space, time, and appearance. In CVPR, 2023.
  9. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
  10. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  11. Text2room: Extracting textured 3d meshes from 2d text-to-image models. arXiv preprint arXiv:2303.11989, 2023.
  12. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022.
  13. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
  14. Zero-shot text-guided object generation with dream fields. In CVPR, pages 867–876, 2022.
  15. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  16. Understanding pure clip guidance for voxel grid nerf models. arXiv preprint arXiv:2209.15172, 2022.
  17. Magic3d: High-resolution text-to-3d content creation. In CVPR, 2023.
  18. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023a.
  19. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023b.
  20. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023c.
  21. Smpl: A skinned multi-person linear model. ACM TOG, 34(6):1–16, 2015.
  22. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, 2022.
  23. Videofusion: Decomposed diffusion models for high-quality video generation. arXiv preprint arXiv:2303.08320, 2023.
  24. A simple yet effective baseline for 3d human pose estimation. In CVPR, pages 2640–2649, 2017.
  25. Realfusion: 360{{\{{\\\backslash\deg}}\}} reconstruction of any object from a single image. In CVPR, 2023.
  26. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, pages 405–421. Springer, 2020.
  27. Instant neural graphics primitives with a multiresolution hash encoding. In SIGGRAPH, 2022.
  28. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In CVPR, 2022.
  29. Temporal interpolation is all you need for dynamic neural radiance fields. In CVPR, 2023.
  30. Dreamfusion: Text-to-3d using 2d diffusion. ICLR, 2022.
  31. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  32. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  33. Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508, 2023.
  34. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  35. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  36. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  37. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022.
  38. Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering. In CVPR, 2023.
  39. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022a.
  40. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022b.
  41. Text-to-4d dynamic scene generation. In ICML, 2023.
  42. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5459–5469, 2022.
  43. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818, 2023.
  44. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  45. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023a.
  46. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023b.
  47. Videocomposer: Compositional video synthesis with motion controllability. In NeurIPS, 2023c.
  48. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023d.
  49. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360{{\{{\\\backslash\deg}}\}} views. In CVPR, 2023.
  50. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  51. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
Citations (48)

Summary

We haven't generated a summary for this paper yet.