VACE Video Generation Model: Diffusion Transformer

Updated 26 December 2025

The paper introduces a unified diffusion transformer framework that integrates video generation and editing via a structured Video Condition Unit and Context Adapter mechanism.
It employs both full fine-tuning and parameter-efficient Res-Tuning to support reference-to-video, video-to-video, and masked video editing tasks in a single pipeline.
The approach achieves competitive performance in quality, consistency, and temporal coherence, streamlining adaptation to diverse video synthesis challenges.

VACE (Video All-in-one Creation and Editing) is a unified framework for video generation and editing based on a Diffusion Transformer (DiT) backbone. Designed to handle diverse video synthesis and manipulation tasks—including reference-to-video generation, video-to-video editing, and masked video-to-video editing—VACE organizes multimodal conditioning signals into a structured Video Condition Unit (VCU) and leverages a Context Adapter mechanism for efficient parameter tuning and flexible task injection. The system achieves specialist-level performance in a range of tasks, while requiring only a single set of weights and inference pipeline (Jiang et al., 10 Mar 2025).

1. System Architecture

VACE utilizes a pre-trained DiT model, extended to consume a multimodal VCU and integrate context signals using two core parameter-tuning methodologies:

Full Fine-Tuning: All DiT parameters and context-processing layers are jointly trained. Context tokens are concatenated to the DiT input.
Context-Adapter (Res-Tuning): The pre-trained DiT weights are frozen. Lightweight Context Blocks process VCU tokens, and the resulting context representations are injected additively into selected DiT layers.

The canonical inference (or training) pipeline is defined as follows:

VCU Assembly: $V = [T; F; M]$ (text prompt, frames, masks).
Concept Decoupling: Frames $F$ are split into $F_c = F \odot M$ (edit region) and $F_k = F \odot (1-M)$ (keep region).
Latent Projection: $F_c$ , $F_k$ , and $M$ are encoded into the VAE latent space $\Lambda$ .
Context Embedding: Latent tensors are tokenized into context tokens $C$ by an embedder with linear and positional encodings.
DiT Diffusion: Standard diffusion is run, integrating context tokens and text tokens, either at every DiT layer (full tuning) or at select ones via Context Blocks (Res-Tuning).
Noise Prediction and Loss: The model predicts noise $\epsilon_\theta(\cdot)$ , computes the loss $L = \mathbb{E}_{x_0,t,\epsilon}\left[\|\hat{\epsilon} - \epsilon\|^2\right]$ , and the optimizer step updates parameters as appropriate.

A block diagram summarizes the dataflow, with VCU inputs feeding into context encoding, and conditioning context tokens injected into the DiT blocks at prescribed points.

2. Video Condition Unit (VCU) Abstraction

The VCU formalizes the input space for all video synthesis tasks, represented as $V = [T; F; M]$ , where:

$T \in \text{Vocab}^*$ : Tokenized text prompt
$F = \{u_1, \ldots, u_n\} \in \mathbb{R}^{n \times h \times w \times 3}$ : Sequence of RGB frames
$M = \{m_1, \ldots, m_n\} \in \{0,1\}^{n \times h \times w}$ : Sequence of binary masks

Task-specific VCU assembly unifies task interfaces:

Text-to-Video (T2V): $F$ is all-zero, $M$ is all-ones.
Reference-to-Video (R2V): $F$ contains reference frames followed by zeros, $M$ indicates reference vs. generate regions.
Video-to-Video (V2V): $F$ and $M$ cover all frames and are all-ones.
Masked Video-to-Video (MV2V): $F$ and $M$ specify original frames and arbitrary spatial/temporal masks for edit regions.

Concept decoupling splits $F$ into edit and keep components via the mask, allowing explicit control and precision over edit localization.

Spatial and temporal positional encodings (3D sinusoids) are added post-linear patchification, facilitating spatiotemporal correspondence. The positional encoding is inherited from DiT: $E_\text{pos}(t, i, j) = \text{sinusoids}(\text{time}=t) \oplus \text{sinusoids}(\text{space}=(i, j))$ .

3. Context Adapter Mechanism

For parameter-efficient tuning and modularity, the Context Adapter introduces a parallel branch of Transformer blocks—Context Blocks—that process only context tokens (and optionally, text tokens). At DiT block $k$ : $\begin{aligned} h^{\text{main}}_{k+1} &= B_k(h^{\text{main}}_k) \ c_{k+1} &= C_k(c_k) \ h^{\text{main}}_{k+1} &\gets h^{\text{main}}_{k+1} + W_g \cdot c_{k+1} \end{aligned}$ where $B_k$ and $C_k$ are the $k$ -th DiT and Context Blocks respectively, and $W_g$ is a learned gating/projection. Only the Context Blocks and projection are trained in Res-Tuning mode; the DiT backbone remains frozen, substantially reducing the optimization footprint.

The number and placement of Context Blocks is configurable (e.g., 14 among 28 DiT layers), enabling a flexible trade-off between model capacity and computational efficiency.

4. Task Specialization via VCU and Loss

VACE's unified backbone supports various tasks solely through specialization of the VCU:

Reference-to-Video Generation (R2V): $V = (T, F=\{r_1 \dots r_l, 0 \dots 0\}, M=\{0\dots0,1\dots1\})$
Video-to-Video Editing (V2V): $V = (T, F=\{u_1\dots u_n\}, M=\{1\dots1\})$
Masked Video-to-Video Editing (MV2V): $V = (T, F=\{u_1\dots u_n\}, M=\{m_1\dots m_n\})$
Compositional Tasks: Arbitrary VCU construction enables mixed reference, masking, and generation by combining $F$ , $M$ accordingly.

For all tasks, the objective is a denoising diffusion loss: $L_{\text{ddpm}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \Vert \epsilon - \epsilon_\theta(x_t, t, C, T) \Vert^2 \right]$ No auxiliary adversarial or reconstruction losses are required; spatial and temporal consistency are implicitly enforced by architectural constraints and joint training.

5. Training Regimen and Implementation

Data Preparation:

Curated approximately 480 diverse video shots spanning text-to-video, inpainting, outpainting, extension, depth, pose, flow, layout, and face/reference scenarios.
Automated shot slicing, resolution filtering, and aesthetic/motion scoring.
Instance masks via Grounding DINO for detection and SAM2 for propagation.
Precomputed control signals: depth (MiDaS), pose (OpenPose), flow (RAFT), and scribble cues (InfDraw).
Augmented masking for inpaint/outpaint (LaMa style); reference tasks involved cropping and augmentation.

Model Configuration:

Example (LTX-Video-based): 28 DiT layers, 14 Context Blocks (placed evenly).
Training: 16×A100 GPUs, batch size effectively 8, AdamW optimizer ( $\text{lr}=1e{-4}$ , weight-decay $=0.1$ ).
200k steps, spatial/temporal resolution $480$p at $8$ FPS (4,992 video tokens).
Sampling: 40 steps, Flow-Euler sampler, classifier-free guidance (CFG) scale $3.0$.
Alternative Wan-T2V variant: up to 40 layers, higher resolutions ($720$p), modified learning rate and steps.

Consistency Mechanisms:

All input contexts and masks are represented in the same latent grid as $x_t$ with 3D sinusoidal encodings.
Concept decoupling aligns edit and preserve signals precisely.
Diverse shot types and sequence durations during training reinforce continuity.

6. Experimental Evaluation

VACE's performance was evaluated against specialized task-specific models using VBench++ automated metrics and human mean opinion scores (MOS).

6.1 Quantitative Metrics

Metrics encompass:

Video Quality: Aesthetic Quality (AQ), Imaging Quality (IQ), Dynamic Degree (DD)
Video Consistency: Motion Smoothness (MS), Temporal Flicker (TF), Subject Consistency (SC), among others
Normalized Average: Mean of eight normalized submetric scores

Selected task-wise results:

Task	Method	AQ	IQ	DD	MS	Consist.	Norm Avg.
I2V	LTX-Video	56.1	62.7	35.0	24.9	92.8%	2.95
	VACE (LTX-based)	57.5	68.0	45.0	25.1	93.6%	3.20
Inpaint	ProPainter	44.7	61.6	50.0	18.5	92.9%	2.35
	VACE	51.3	60.4	50.0	21.1	94.6%	2.40
Depth	Control-A-Video	50.6	67.8	70.0	24.5	88.1%	2.70
	VACE	56.7	66.4	60.0	25.3	94.1%	3.10
R2V	Vidu2.0 (closed)	64.3	67.0	35.0	26.5	96.7%	3.90
	VACE	63.3	72.3	30.0	25.9	98.5%	3.47

6.2 Human Mean Opinion Scores

Humans rated Prompt Following, Temporal Consistency, and Video Quality (scale 1–5):

Task	Method	Prompt	Temp Cons.	Quality	Avg
I2V	LTX-Video	2.28	2.28	2.50	2.35
	VACE	4.00	2.54	3.24	3.26
Depth	ControlVid	2.50	1.82	2.29	2.20
	VACE	3.92	2.66	3.23	3.27

6.3 Qualitative Analysis

Reference-based generation preserves identity consistently across video frames.
Masked editing performs seamless, flicker-free spatial–temporal compositing.
Composite tasks (object motion, scene extension, or swaps) exhibit flexibility due to VCU and Context Adapter design.

A plausible implication is that the VCU and Context Adapter abstraction decouples input construction from the underlying network, enabling rapid extension to new tasks without modifying the core diffusion transformer.

7. Significance and Applicability

VACE demonstrates that a single DiT-based unified architecture, equipped with structured multimodal conditioning (VCU) and modular context adaptation (Context Adapter), can achieve performance rivaling that of highly specialized models across a suite of core video synthesis and editing tasks. No auxiliary adversarial or specialized reconstruction losses are required; all consistency and alignment emerge from architectural design and data construction. The combination of compositional VCU inputs, latent concept decoupling, and efficient Res-Tuning enables scaling and adaptation to future video editing paradigms, suggesting robust applicability in broad multimodal video generation domains (Jiang et al., 10 Mar 2025).

PDF Markdown Chat (Pro)

References (1)

VACE: All-in-One Video Creation and Editing (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to VACE Video Generation Model.