VACE: All-in-One Video Creation and Editing

Published 10 Mar 2025 in cs.CV | (2503.07598v2)

Abstract: Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging. We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly. Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations. Project page: https://ali-vilab.github.io/VACE-Page/.

Abstract PDF Upgrade to Chat

Summary

The paper introduces VACE, an all-in-one framework using a Diffusion Transformer to unify diverse video creation and editing tasks into a single model.
VACE utilizes a Video Condition Unit (VCU) to integrate various task inputs (text, image, video, mask) into a consistent format and employs Concept Decoupling with a Context Adapter for enhanced visual concept processing.
The framework achieves competitive performance across multiple subtasks and supports complex task composition, demonstrating the efficiency and flexibility of the unified approach.

The VACE paper introduces an all-in-one framework for video creation and editing, designed to unify several video tasks into a single model. This approach aims to handle diverse video synthesis tasks with both flexibility and efficiency.

Capabilities:

VACE supports a range of video tasks:

Reference-to-Video Generation (R2V): Generates videos from reference images, ensuring specific content is reflected in the output video.
Video-to-Video Editing (V2V): Edits an entire input video using control signals represented as RGB videos (e.g., depth, pose, style).
Masked Video-to-Video Editing (MV2V): Edits specific regions of interest within an input video, using spatiotemporal masks.
Text-to-Video Generation (T2V): A basic video creation task using text as the only input.
Task Composition: Combines the above tasks for more complex and controllable video synthesis.

Methods:

VACE uses a Diffusion Transformer (DiT) architecture, building upon pre-trained text-to-video generation models. Key methodological components include:

Video Condition Unit (VCU): A unified interface that integrates different task inputs (editing, reference, masking) into a consistent format $V=[T; F; M]$ , where T is the text prompt, F is a sequence of context video frames, and M is a sequence of masks. VCU handles text prompts, frame sequences, and mask sequences, providing a structured input for diverse tasks. Default values ensure tasks lacking specific inputs (e.g., T2V) can still be processed. Task composition is achieved by concatenating frames and masks.
Concept Decoupling: Separates visual concepts within input frames (F) into reactive frames ( $F_c$ $F_{c}$ - pixels to be changed) and inactive frames ( $F_k$ $F_{k}$ - pixels to be kept) using masks, as follows:
- $F_c = F \times M$
- $F_k = F \times (1-M)$
Context Latent Encoding: Maps the decoupled frame sequences ( $F_c$ , $F_k$ ) and masks (M) into a high-dimensional latent space, aligning them spatiotemporally with the DiT's noisy video latents. Video VAEs encode frames, and masks are reshaped and interpolated to match the latent space dimensions.
Context Embedder: Tokenizes the encoded context information ( $F_c$ , $F_k$ , M) into context tokens, which are then fed into the DiT model. The weights for tokenizing $F_c$ and $F_k$ are initialized by copying from the original video embedder, while mask tokenization weights are initialized to zero.
Training Strategies:
- Fully Fine-Tuning: Updates all DiT parameters and the new Context Embedder during training.
- Context Adapter Tuning: Freezes the original DiT parameters and introduces Context Blocks (copied Transformer Blocks from DiT) to process context tokens and text tokens. The output of the Context Blocks is added back to the DiT blocks as an additive signal. This approach allows for faster convergence and pluggable feature integration.

Contributions:

Unified Video Synthesis Framework: VACE offers an all-in-one solution for various video creation and editing tasks, eliminating the need for separate task-specific models.
Video Condition Unit (VCU): A versatile input representation that can accommodate different input modalities (text, image, video, mask) and task requirements.
Concept Decoupling and Context Adapter: Enhances the model’s ability to understand and process different visual concepts within the input data.
Competitive Performance: VACE achieves performance comparable to or better than task-specific models on various subtasks, as shown by quantitative and qualitative evaluations.
Task Composition: Enables complex video editing workflows by combining multiple tasks within a single model.
VACE-Benchmark: A benchmark created for systematically evaluating downstream tasks related to video.

In summary, VACE introduces a unified framework for video creation and editing by integrating different tasks into a single model. It leverages the Video Condition Unit (VCU) and Context Adapter to handle various inputs and enhance visual concept processing, achieving competitive performance and enabling complex task compositions.

Markdown