Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DreamVideo: Composing Your Dream Videos with Customized Subject and Motion (2312.04433v1)

Published 7 Dec 2023 in cs.CV

Abstract: Customized generation using diffusion models has made impressive progress in image generation, but remains unsatisfactory in the challenging video generation task, as it requires the controllability of both subjects and motions. To that end, we present DreamVideo, a novel approach to generating personalized videos from a few static images of the desired subject and a few videos of target motion. DreamVideo decouples this task into two stages, subject learning and motion learning, by leveraging a pre-trained video diffusion model. The subject learning aims to accurately capture the fine appearance of the subject from provided images, which is achieved by combining textual inversion and fine-tuning of our carefully designed identity adapter. In motion learning, we architect a motion adapter and fine-tune it on the given videos to effectively model the target motion pattern. Combining these two lightweight and efficient adapters allows for flexible customization of any subject with any motion. Extensive experimental results demonstrate the superior performance of our DreamVideo over the state-of-the-art methods for customized video generation. Our project page is at https://dreamvideo-t2v.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. Latent-Shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  2. Break-A-Scene: Extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311, 2023.
  3. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 2022.
  4. Conditional gan with discriminative filter generation for text-to-video synthesis. In IJCAI, page 2, 2019.
  5. Analytic-DPM: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022.
  6. Align your Latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  7. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  8. Pix2Video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23206–23217, 2023.
  9. StableVideo: Text-driven consistency-aware diffusion video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23040–23050, 2023.
  10. VideoCrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
  11. DisenBooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation. arXiv preprint arXiv:2305.03374, 2023b.
  12. AdaptFormer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022.
  13. Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186, 2023c.
  14. AnyDoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023d.
  15. DiffSynth: Latent in-iteration deflickering for realistic video synthesis. arXiv preprint arXiv:2308.03463, 2023.
  16. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  17. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  18. Long video generation with time-agnostic vqgan and time-sensitive transformer. In European Conference on Computer Vision, pages 102–118. Springer, 2022.
  19. Preserve Your Own Correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023.
  20. Mix-of-Show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292, 2023.
  21. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  22. SvDiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:2303.11305, 2023.
  23. Flexible diffusion modeling of long videos. Advances in Neural Information Processing Systems, 35:27953–27965, 2022.
  24. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  25. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  26. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  27. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  28. Imagen Video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  29. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022b.
  30. Large language models are frame-level directors for zero-shot text-to-video generation. arXiv preprint arXiv:2305.14330, 2023.
  31. CogVideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  32. LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  33. Free-Bloom: Zero-shot text-to-video generator with llm director and ldm animator. arXiv preprint arXiv:2309.14494, 2023a.
  34. Text2Video-Zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  35. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  36. CCVS: context-aware controllable video synthesis. Advances in Neural Information Processing Systems, 34:14042–14055, 2021.
  37. VideoGen: A reference-guided latent diffusion approach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398, 2023.
  38. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  39. Few-shot image generation with elastic weight consolidation. arXiv preprint arXiv:2012.02780, 2020.
  40. Video-P2P: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023a.
  41. Cones: Concept neurons in diffusion models for customized generation. arXiv preprint arXiv:2303.05125, 2023b.
  42. Cones 2: Customizable image synthesis with multiple subjects. arXiv preprint arXiv:2305.19327, 2023c.
  43. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  44. DPM-Solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  45. VideoFusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
  46. Subject-Diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410, 2023.
  47. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
  48. ST-Adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35:26462–26477, 2022.
  49. FateZero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
  50. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763, 2021.
  51. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  52. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023a.
  53. HyperDreamBooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949, 2023b.
  54. MoStGAN-V: Video generation with temporal motion styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5652–5661, 2023.
  55. InstantBooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
  56. Make-A-Video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  57. StyleGAN-V: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3626–3636, 2022.
  58. Continual Diffusion: Continual customization of text-to-image diffusion with c-lora. arXiv preprint arXiv:2304.06027, 2023.
  59. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  60. Action recognition in realistic sports videos. In Computer vision in sports, pages 181–208. Springer, 2015.
  61. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  62. A good image generator is what you need for high-resolution video synthesis. arXiv preprint arXiv:2104.15069, 2021.
  63. Generating videos with scene dynamics. Advances in neural information processing systems, 29, 2016.
  64. ModelScope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  65. VideoFactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023b.
  66. MoCoGAN: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018.
  67. VideoComposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023c.
  68. LaVie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023d.
  69. ELITE: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848, 2023.
  70. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023a.
  71. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  72. LAMP: Learn a motion pattern for few-shot-based video generation. arXiv preprint arXiv:2310.10769, 2023c.
  73. FastComposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
  74. SimDA: Simple diffusion adapter for efficient video generation. arXiv preprint arXiv:2308.09710, 2023.
  75. VideoGPT: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  76. Diffusion probabilistic modeling for video generation. Entropy, 25(10):1469, 2023a.
  77. AIM: Adapting image models for efficient video action recognition. arXiv preprint arXiv:2302.03024, 2023b.
  78. IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
  79. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
  80. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023b.
  81. I2VGen-XL: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023c.
  82. ControlVideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023c.
  83. MotionDirector: Motion customization of text-to-video diffusion models. arXiv preprint arXiv:2310.08465, 2023.
  84. MagicVideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yujie Wei (24 papers)
  2. Shiwei Zhang (179 papers)
  3. Zhiwu Qing (29 papers)
  4. Hangjie Yuan (36 papers)
  5. Zhiheng Liu (22 papers)
  6. Yu Liu (786 papers)
  7. Yingya Zhang (43 papers)
  8. Jingren Zhou (198 papers)
  9. Hongming Shan (91 papers)
Citations (53)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub