VidToMe: Video Token Merging for Zero-Shot Video Editing (2312.10656v2)

Published 17 Dec 2023 in cs.CV

Abstract: Diffusion models have made significant advances in generating high-quality images, but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by utilizing pre-trained image diffusion models to translate source videos into new ones. Nevertheless, existing methods struggle to maintain strict temporal consistency and efficient memory consumption. In this work, we propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. By aligning and compressing temporally redundant tokens across frames, our method improves temporal coherence and reduces memory consumption in self-attention computations. The merging strategy matches and aligns tokens according to the temporal correspondence between frames, facilitating natural temporal consistency in generated video frames. To manage the complexity of video processing, we divide videos into chunks and develop intra-chunk local token merging and inter-chunk global token merging, ensuring both short-term video continuity and long-term content consistency. Our video editing approach seamlessly extends the advancements in image editing to video editing, rendering favorable results in temporal consistency over state-of-the-art methods.

PDF HTML Abstract

Overview

The field of artificial intelligence has long engaged in improving how machines can interpret and manipulate visual media. While significant strides have been made in image generation with diffusion models, video generation remains a complex challenge due to the intricacies of temporal motion. The paper introduces "VidToMe," an innovative method that improves temporal consistency in video editing without the need to train on vast amounts of data. This technique is particularly suited for zero-shot video editing, where pre-trained image diffusion models translate source videos into new ones while retaining the original motion.

Temporal Coherence

One of the main issues with current video generation techniques is ensuring strict temporal consistency. Existing models often struggle to produce frames with consistent details over time, negatively impacting the perceived quality. The VidToMe method directly addresses this by aligning and compressing tokens (fundamental processing units in self-attention operations of diffusion models) across frames, effectively enhancing temporal coherence. The approach matches tokens based on temporal correspondence between frames, allowing the natural flow of time to be maintained in the generated content.

Computational Efficiency

Processing video involves handling a tremendous amount of data, making computational efficiency a key challenge. VidToMe innovatively handles this by introducing intra-chunk local token merging and inter-chunk global token merging. This strategy not only ensures the needed short-term continuity within video chunks but also maintains long-term content consistency throughout the video. By operating in chunks, VidToMe manages the complexity of video processing in a way that reduces memory consumption and computational burden.

Integration and Performance

The proposed video editing approach is seamlessly extendable, leveraging advancements from image editing diffusion models. VidToMe can work in harmony with existing image editing methods, bringing about text-aligned and temporally consistent video editing results. Through comprehensive experimentation, VidToMe has demonstrated superior performance over existing state-of-the-art methods in producing temporally consistent videos with high fidelity to editing prompts.

Contributions

The paper outlines three main contributions of VidToMe to the field of AI-based video editing:

A novel method for enhancing temporal consistency in video generation by merging self-attention tokens across frames.
A dual strategy for local and global token merging, facilitating both short-term and long-term consistency in videos.
Demonstrated superiority in maintaining temporal consistency and computational efficiency compared to state-of-the-art zero-shot video editing methods.

In conclusion, "VidToMe: Video Token Merging for Zero-Shot Video Editing" presents a breakthrough in the approach to video editing, advancing the capabilities of AI in understanding and manipulating temporal media. With its improved consistency and efficiency, this method sets a new standard for future research and applications in video generation and editing.