FateZero: Fusing Attentions for Zero-shot Text-based Video Editing (2303.09535v3)

Published 16 Mar 2023 in cs.CV

Abstract: The diffusion-based generative models have achieved remarkable success in text-based image generation. However, since it contains enormous randomness in generation progress, it is still challenging to apply such models for real-world visual content editing, especially in videos. In this paper, we propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask. To edit videos consistently, we propose several techniques based on the pre-trained models. Firstly, in contrast to the straightforward DDIM inversion technique, our approach captures intermediate attention maps during inversion, which effectively retain both structural and motion information. These maps are directly fused in the editing process rather than generated during denoising. To further minimize semantic leakage of the source video, we then fuse self-attentions with a blending mask obtained by cross-attention features from the source prompt. Furthermore, we have implemented a reform of the self-attention mechanism in denoising UNet by introducing spatial-temporal attention to ensure frame consistency. Yet succinct, our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model. We also have a better zero-shot shape-aware editing ability based on the text-to-video model. Extensive experiments demonstrate our superior temporal consistency and editing capability than previous works.

References (55)

Authors (7)

Chenyang Qi (17 papers)
Xiaodong Cun (61 papers)
Yong Zhang (660 papers)
Chenyang Lei (27 papers)
Xintao Wang (132 papers)
Ying Shan (252 papers)
Qifeng Chen (188 papers)

Citations (263)

View on Semantic Scholar

Summary

The paper introduces FateZero, a framework that fuses intermediate attention maps with diffusion models to achieve zero-shot text-based video editing.
It employs attention map remixing and spatial-temporal reform in the denoising UNet to capture structure and motion while ensuring frame consistency.
It demonstrates superior temporal consistency and editing performance, advancing practical applications of diffusion models in video and image editing.

Overview of FateZero: A Zero-Shot Text-Driven Video Editing Method

The paper introduces FateZero, a novel zero-shot text-driven video editing framework designed to enhance the capabilities of pre-trained diffusion models for consistent and high-quality video editing. This approach marks a significant contribution to the field of video content editing by leveraging diffusion-based generative models, which have primarily been successful in text-based image generation.

Key Contributions

FateZero primarily addresses the challenges of maintaining temporal consistency and the inherent randomness in diffusion models when applied to video editing. The proposed method diverges from conventional two-stage pipelines of independent inversion and generation. Instead, it focuses on utilizing intermediate attention maps during inversion to capture structure and motion information effectively. These maps are then reformulated into temporally causal attention maps and strategically replaced during the generation process.

The method involves several unique techniques:

Intermediate Attention Maps: FateZero enhances the editing quality by capturing intermediate attention maps during the inversion process. These maps offer better structure and motion information retention throughout the editing process.
Attention Map Remixing: To reduce semantic leakage from the source video and enhance editing quality, the approach incorporates remixing of temporally casual attention via cross-attention features of the source prompt, acting as a mask.
Spatial-Temporal Attention Reform: The framework introduces a reform of the self-attention mechanism in the denoising UNet, where spatial-temporal attention ensures frame consistency.

Numerical Results and Claims

FateZero achieves enhanced temporal consistency and superior editing capabilities compared to prior methods. It leverages the capability of diffusion models to achieve zero-shot text-driven video style and local attribute editing effectively. The method demonstrates excellent results in zero-shot image editing, providing evidence of its versatility and effectiveness in both video and image domains.

Implications and Future Directions

The implications of FateZero are substantial within the generative model space, particularly in application areas requiring seamless video editing capabilities without the need for per-prompt training or specific masks. By exploiting pre-trained diffusion models for video editing, it broadens the scope of applications and advances the usability of these models in practical scenarios.

From a theoretical standpoint, FateZero underscores the potential of utilizing intermediate attention maps in the inversion process to enhance generative model outputs. Future research could explore extending these concepts to even more complex editing tasks or integrating them with other sophisticated generative paradigms to further refine video editing quality.

FateZero sets a new precedent in zero-shot content manipulation, offering a robust framework that strongly challenges the status quo in video editing techniques utilizing generative models.

PDF Markdown