The VACE paper introduces an all-in-one framework for video creation and editing, designed to unify several video tasks into a single model. This approach aims to handle diverse video synthesis tasks with both flexibility and efficiency.
Capabilities:
VACE supports a range of video tasks:
- Reference-to-Video Generation (R2V): Generates videos from reference images, ensuring specific content is reflected in the output video.
- Video-to-Video Editing (V2V): Edits an entire input video using control signals represented as RGB videos (e.g., depth, pose, style).
- Masked Video-to-Video Editing (MV2V): Edits specific regions of interest within an input video, using spatiotemporal masks.
- Text-to-Video Generation (T2V): A basic video creation task using text as the only input.
- Task Composition: Combines the above tasks for more complex and controllable video synthesis.
Methods:
VACE uses a Diffusion Transformer (DiT) architecture, building upon pre-trained text-to-video generation models. Key methodological components include:
- Video Condition Unit (VCU): A unified interface that integrates different task inputs (editing, reference, masking) into a consistent format , where T is the text prompt, F is a sequence of context video frames, and M is a sequence of masks. VCU handles text prompts, frame sequences, and mask sequences, providing a structured input for diverse tasks. Default values ensure tasks lacking specific inputs (e.g., T2V) can still be processed. Task composition is achieved by concatenating frames and masks.
- Concept Decoupling: Separates visual concepts within input frames (F) into reactive frames ( - pixels to be changed) and inactive frames ( - pixels to be kept) using masks, as follows:
- Context Latent Encoding: Maps the decoupled frame sequences (, ) and masks (M) into a high-dimensional latent space, aligning them spatiotemporally with the DiT's noisy video latents. Video VAEs encode frames, and masks are reshaped and interpolated to match the latent space dimensions.
- Context Embedder: Tokenizes the encoded context information (, , M) into context tokens, which are then fed into the DiT model. The weights for tokenizing and are initialized by copying from the original video embedder, while mask tokenization weights are initialized to zero.
- Training Strategies:
- Fully Fine-Tuning: Updates all DiT parameters and the new Context Embedder during training.
- Context Adapter Tuning: Freezes the original DiT parameters and introduces Context Blocks (copied Transformer Blocks from DiT) to process context tokens and text tokens. The output of the Context Blocks is added back to the DiT blocks as an additive signal. This approach allows for faster convergence and pluggable feature integration.
Contributions:
- Unified Video Synthesis Framework: VACE offers an all-in-one solution for various video creation and editing tasks, eliminating the need for separate task-specific models.
- Video Condition Unit (VCU): A versatile input representation that can accommodate different input modalities (text, image, video, mask) and task requirements.
- Concept Decoupling and Context Adapter: Enhances the model’s ability to understand and process different visual concepts within the input data.
- Competitive Performance: VACE achieves performance comparable to or better than task-specific models on various subtasks, as shown by quantitative and qualitative evaluations.
- Task Composition: Enables complex video editing workflows by combining multiple tasks within a single model.
- VACE-Benchmark: A benchmark created for systematically evaluating downstream tasks related to video.
In summary, VACE introduces a unified framework for video creation and editing by integrating different tasks into a single model. It leverages the Video Condition Unit (VCU) and Context Adapter to handle various inputs and enhance visual concept processing, achieving competitive performance and enabling complex task compositions.