MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing (2311.17338v3)
Abstract: The diffusion model is widely leveraged for either video generation or video editing. As each field has its task-specific problems, it is difficult to merely develop a single diffusion for completing both tasks simultaneously. Video diffusion sorely relying on the text prompt can be adapted to unify the two tasks. However, it lacks a high capability of aligning heterogeneous modalities between text and image, leading to various misalignment problems. In this work, we are the first to propose a unified Multi-alignment Diffusion, dubbed as MagDiff, for both tasks of high-fidelity video generation and editing. The proposed MagDiff introduces three types of alignments, including subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment. Particularly, the subject-driven alignment is put forward to trade off the image and text prompts, serving as a unified foundation generative model for both tasks. The adaptive prompts alignment is introduced to emphasize different strengths of homogeneous and heterogeneous alignments by assigning different values of weights to the image and the text prompts. The high-fidelity alignment is developed to further enhance the fidelity of both video generation and editing by taking the subject image as an additional model input. Experimental results on four benchmarks suggest that our method outperforms the previous method on each task.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575, 2023.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017.
- Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning. arXiv preprint arXiv:2311.00990, 2023a.
- Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023b.
- Motion-conditioned diffusion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023c.
- Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
- Preserve your own correlation: A noise prior for video diffusion models. In CVPR, pages 22930–22941, 2023.
- Reuse and diffuse: Iterative denoising for text-to-video generation. arXiv preprint arXiv:2309.03549, 2023.
- Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
- Video diffusion models. arXiv preprint arXiv:2204.03458, 2022b.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In ICLR, 2023.
- Loveu@cvpr’23 - track4, 2023.
- The power of sound (tpos): Audio reactive video generation with stable diffusion. In ICCV, 2023.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Dreampose: Fashion image-to-video synthesis via stable diffusion. arXiv preprint arXiv:2304.06025, 2023.
- Diffusion autoencoders: Toward a meaningful and decodable representation. In CVPR, pages 10619–10629, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Generative image dynamics. arXiv preprint arXiv:2309.07906, 2023b.
- Rensis Likert. A technique for the measurement of attitudes. Archives of psychology, 1932.
- Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410, 2023a.
- Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint arXiv:2304.01186, 2023b.
- Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
- Compositional visual generation with composable diffusion models. In ECCV, pages 423–439, 2022.
- Conditional image-to-video generation with latent flow diffusion models. In CVPR, pages 18444–18455, 2023.
- On aliased resizing and surprising subtleties in gan evaluation. In CVPR, pages 11410–11420, 2022.
- Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
- Dancing avatar: pose and text-guided human motion videos synthesis with image diffusion model. arXiv preprint arXiv:2308.07749, 2023.
- Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
- Generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. IJCV, 128:2586–2606, 2020.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
- Fvd: A new metric for video generation. ICLR Workshop, 2019.
- Disco: Disentangled control for realistic human dance generation. arXiv preprint arXiv:2307.00040, 2023a.
- Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023b.
- Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023c.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, pages 7623–7633, 2023.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
- Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023.
- Adding conditional control to text-to-image diffusion models. In CVPR, pages 3836–3847, 2023a.
- I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023b.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Haoyu Zhao (41 papers)
- Tianyi Lu (8 papers)
- Jiaxi Gu (17 papers)
- Xing Zhang (104 papers)
- Zuxuan Wu (144 papers)
- Hang Xu (205 papers)
- Yu-Gang Jiang (223 papers)
- Qingping Zheng (5 papers)