VideoCanvas: Unified Video Generation

Updated 11 October 2025

VideoCanvas is a unified framework for arbitrary spatio-temporal video completion that blends tasks like inpainting, interpolation, and outpainting.
It employs a hybrid conditioning strategy using spatial zero-padding and temporal RoPE interpolation to overcome temporal ambiguity in causal VAEs.
Benchmark results on VideoCanvasBench show improved intra-scene fidelity and inter-scene creativity, highlighting its practical use in video editing and creative content generation.

VideoCanvas is a unified video generative framework enabling arbitrary spatio-temporal video completion by synthesizing coherent video sequences from user-specified pixel patches placed at any spatial location and timestamp, analogous to painting on a video canvas. This paradigm subsumes prior tasks such as inpainting, outpainting, interpolation, and image-to-video generation under a single model, leveraging a hybrid conditioning strategy to overcome temporal ambiguity present in causal VAEs.

1. Formulation and Scope

VideoCanvas defines the video generation problem in terms of arbitrary spatio-temporal conditioning. Given a video $X = \{x_0, x_1, ..., x_{T-1}\}$ and a set of user conditions $\mathcal{P} = \{(p_i, m_i, t_i)\}_{i=1}^M$ , each $p_i$ is an image patch, $m_i$ is a spatial mask specifying patch placement, and $t_i$ is a time index. The generative objective is to produce a video $\hat{X}$ such that the generated pixels satisfy $\hat{X}[t_i] \odot m_i \approx p_i$ for all $i = 1, ..., M$ .

This generalization centrally unifies existing video generation tasks:

First-frame image-to-video (single frame at $t=0$ ).
Video inpainting and outpainting (patches at arbitrary space-time indices).
Inter-frame interpolation, extension, and cross-scene transition (multiple non-homologous patches placed across time).

2. Technical Challenges in Latent Video Diffusion Models

Modern latent video diffusion models employ causal VAEs that compress multiple consecutive pixel frames into single latent slots via a fixed temporal stride ( $N$ ). This architecture introduces temporal ambiguity: several pixel frames share a latent representation, impeding precise frame-accurate conditioning.

A plausible implication is that any conditioning mechanism targeting pixel-specific frames is inherently coarse unless the stride is minimized, which severely impacts computational efficiency.

3. Hybrid Conditioning Strategy

VideoCanvas introduces a hybrid conditioning strategy, decoupling spatial and temporal control for arbitrarily placed patches:

Spatial Zero-Padding:

Each input patch $p_i$ is embedded into its spatial location on a full-frame canvas using mask $m_i$ , resulting in $x_\text{prep,i} = m_i \odot p_i$ . Encoding with a frozen pre-trained VAE ensures that only the conditioned region contains defined content, while the remainder is neutral, preventing out-of-distribution artifacts.

Temporal RoPE Interpolation:

For models with latent compression, temporal ambiguity is resolved by assigning each conditional latent a fractional temporal position:

$\text{pos}_t(z_{\text{cond},i}) = t_i / N$

where $z_{\text{cond},i}$ is the conditional latent, $t_i$ its pixel timestamp, and $N$ the VAE stride. For example, $t_i = 41$ and $N = 4$ yields an interpolated index of $10.25$.

4. In-Context Conditioning (ICC) Mechanism

Building upon the ICC paradigm, all conditional tokens (from user patches/frames) and the source video latent tokens are concatenated into a single input sequence for the frozen transformer diffusion backbone. Unlike latent replacement or channel concatenation, which disrupts structural context, ICC concatenation preserves all provided conditional content while enabling the attention mechanism to leverage full context for coherent synthesis.

The combined latent sequence is

$z = \text{Concat}(\{z_{\text{cond},i}\}_{i=1}^M, z_{\text{source}})$

where the unified input attends jointly to conditional regions and global video.

5. Benchmarking and Comparative Performance

VideoCanvas introduces VideoCanvasBench, the first benchmark specialized for arbitrary spatio-temporal video completion. Evaluation facets include:

Intra-Scene Fidelity: Quantified via PSNR and FVD, measuring match to ground-truth and temporal consistency.
Inter-Scene Creativity: Scores for Aesthetic Quality, Imaging Quality, Temporal Coherence, Dynamic Degree, and human forced-choice on Visual/Semantic Quality and Overall Preference.

According to experimental results, VideoCanvas demonstrates improved intra-scene fidelity and inter-scene creativity relative to latent replacement and channel concatenation baselines, with higher PSNR, lower FVD, and superior perceptual metrics.

6. Practical Applications

The generality of VideoCanvas's conditioning is leveraged for several key tasks:

Any-Timestamp Full-Frame Conditioning (AnyI2V): Allows scenario generation with full frames at arbitrary times, supporting diverse narrative flows.
Patch-to-Video Generation (AnyP2V): Sparse, non-contiguous regions at any timestamp and placement can be painted to steer generation, enabling granular control over local synthesis.
Video Transitions and Boundary Effects: By placing patches of distinct scenes at chosen spatial-temporal points, the framework smoothly transitions, inpaints, outpaints, or extends content beyond initial boundaries.
Camera Control and Creative Effects: Spatial and temporal placement supports progressive translation, scaling, and arbitrary camera mimetic effects without retraining.
Long-Duration Generation: Iterative autoregressive use and prompt integration enable significant extension of sequence length, potentially supporting looping or extended narratives.

7. Implications and Prospects

By resolving temporal ambiguity and supporting fine-grained control without retraining or parameter addition, VideoCanvas marks an advancement toward unified, flexible, and controllable video synthesis. Its decoupled control over spatial and temporal dimensions using hybrid conditioning and ICC addresses limitations of prior models, supporting expansion into postproduction tools, content creation workflows, and systems needing reconstruction from partial data.

The method's architecture suggests scalability to further arbitrary conditioning paradigms and higher-dimensional generative settings, contingent on future refinements in VAE temporal encoding and transformer context capacity. Evaluations on VideoCanvasBench substantiate the claim that ICC-based hybrid conditioning materially improves performance for arbitrary spatio-temporal control in video generation (Cai et al., 9 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning (2025)

VideoCanvas: Unified Video Generation

1. Formulation and Scope

2. Technical Challenges in Latent Video Diffusion Models

3. Hybrid Conditioning Strategy

4. In-Context Conditioning (ICC) Mechanism

5. Benchmarking and Comparative Performance

6. Practical Applications

7. Implications and Prospects

Whiteboard

Follow Topic

Continue Learning

VideoCanvas: Unified Video Generation

1. Formulation and Scope

2. Technical Challenges in Latent Video Diffusion Models

3. Hybrid Conditioning Strategy

4. In-Context Conditioning (ICC) Mechanism

5. Benchmarking and Comparative Performance

6. Practical Applications

7. Implications and Prospects

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics