Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing (2303.09535v3)

Published 16 Mar 2023 in cs.CV

Abstract: The diffusion-based generative models have achieved remarkable success in text-based image generation. However, since it contains enormous randomness in generation progress, it is still challenging to apply such models for real-world visual content editing, especially in videos. In this paper, we propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask. To edit videos consistently, we propose several techniques based on the pre-trained models. Firstly, in contrast to the straightforward DDIM inversion technique, our approach captures intermediate attention maps during inversion, which effectively retain both structural and motion information. These maps are directly fused in the editing process rather than generated during denoising. To further minimize semantic leakage of the source video, we then fuse self-attentions with a blending mask obtained by cross-attention features from the source prompt. Furthermore, we have implemented a reform of the self-attention mechanism in denoising UNet by introducing spatial-temporal attention to ensure frame consistency. Yet succinct, our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model. We also have a better zero-shot shape-aware editing ability based on the text-to-video model. Extensive experiments demonstrate our superior temporal consistency and editing capability than previous works.

Overview of FateZero: A Zero-Shot Text-Driven Video Editing Method

The paper introduces FateZero, a novel zero-shot text-driven video editing framework designed to enhance the capabilities of pre-trained diffusion models for consistent and high-quality video editing. This approach marks a significant contribution to the field of video content editing by leveraging diffusion-based generative models, which have primarily been successful in text-based image generation.

Key Contributions

FateZero primarily addresses the challenges of maintaining temporal consistency and the inherent randomness in diffusion models when applied to video editing. The proposed method diverges from conventional two-stage pipelines of independent inversion and generation. Instead, it focuses on utilizing intermediate attention maps during inversion to capture structure and motion information effectively. These maps are then reformulated into temporally causal attention maps and strategically replaced during the generation process.

The method involves several unique techniques:

  1. Intermediate Attention Maps: FateZero enhances the editing quality by capturing intermediate attention maps during the inversion process. These maps offer better structure and motion information retention throughout the editing process.
  2. Attention Map Remixing: To reduce semantic leakage from the source video and enhance editing quality, the approach incorporates remixing of temporally casual attention via cross-attention features of the source prompt, acting as a mask.
  3. Spatial-Temporal Attention Reform: The framework introduces a reform of the self-attention mechanism in the denoising UNet, where spatial-temporal attention ensures frame consistency.

Numerical Results and Claims

FateZero achieves enhanced temporal consistency and superior editing capabilities compared to prior methods. It leverages the capability of diffusion models to achieve zero-shot text-driven video style and local attribute editing effectively. The method demonstrates excellent results in zero-shot image editing, providing evidence of its versatility and effectiveness in both video and image domains.

Implications and Future Directions

The implications of FateZero are substantial within the generative model space, particularly in application areas requiring seamless video editing capabilities without the need for per-prompt training or specific masks. By exploiting pre-trained diffusion models for video editing, it broadens the scope of applications and advances the usability of these models in practical scenarios.

From a theoretical standpoint, FateZero underscores the potential of utilizing intermediate attention maps in the inversion process to enhance generative model outputs. Future research could explore extending these concepts to even more complex editing tasks or integrating them with other sophisticated generative paradigms to further refine video editing quality.

FateZero sets a new precedent in zero-shot content manipulation, offering a robust framework that strongly challenges the status quo in video editing techniques utilizing generative models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. https://civitai.com, 2020.
  2. Blended latent diffusion. arXiv preprint arXiv:2206.02779, 2022.
  3. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
  4. Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision, pages 707–723. Springer, 2022.
  5. Blind video temporal consistency. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2015), 34(6), 2015.
  6. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  7. Diffusion models beat gans on image synthesis. Neural Information Processing Systems, 2021.
  8. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
  9. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
  10. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  11. Stylit: illumination-guided example-based stylization of 3d renderings. ACM Transactions on Graphics (TOG), 35(4):1–11, 2016.
  12. Example-based synthesis of stylized facial animations. ACM Transactions on Graphics (TOG), 36(4):1–11, 2017.
  13. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
  14. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  15. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  16. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  17. Clipscore: A reference-free evaluation metric for image captioning. Empirical Methods in Natural Language Processing, 2021.
  18. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  19. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  20. Video diffusion models. arXiv:2204.03458, 2022.
  21. Stylizing video by example. ACM Trans. Graph., 38(4), jul 2019.
  22. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 694–711. Springer, 2016.
  23. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  24. Layered neural atlases for consistent video editing. ACM Transactions on Graphics (TOG), 40(6):1–12, 2021.
  25. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
  26. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  27. Learning blind video temporal consistency. In Proceedings of the European conference on computer vision (ECCV), pages 170–185, 2018.
  28. Shape-aware text-driven layered video editing demo. arXiv preprint arXiv:2301.13173, 2023.
  29. Blind video temporal consistency via deep video prior. In Advances in Neural Information Processing Systems, 2020.
  30. Deep video prior for video consistency and propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):356–371, 2022.
  31. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. 2023.
  32. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  33. Clip: Learning to solve visual tasks by unsupervised learning of language representations. In International Conference on Machine Learning, 2020.
  34. Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
  35. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
  36. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  37. Zero-shot image-to-image translation. arXiv preprint arXiv:2302.03027, 2023.
  38. The 2017 davis challenge on video object segmentation. arXiv: Computer Vision and Pattern Recognition, 2017.
  39. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  40. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  41. High-resolution image synthesis with latent diffusion models, 2021.
  42. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  43. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  44. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  45. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  46. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11895–11907, 2019.
  47. Plug-and-play diffusion features for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022.
  48. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  49. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  50. Phenaki: Variable length video generation from open domain textual descriptions. In International Conference on Learning Representations, 2023.
  51. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
  52. Vtoonify: Controllable high-resolution portrait video style transfer. ACM Transactions on Graphics (TOG), 41(6):1–15, 2022.
  53. Adding conditional control to text-to-image diffusion models, 2023.
  54. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  55. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chenyang Qi (17 papers)
  2. Xiaodong Cun (61 papers)
  3. Yong Zhang (660 papers)
  4. Chenyang Lei (27 papers)
  5. Xintao Wang (132 papers)
  6. Ying Shan (252 papers)
  7. Qifeng Chen (187 papers)
Citations (263)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets