VideoCanvasBench: Spatio-Temporal Video Benchmark

Updated 11 October 2025

VideoCanvasBench is a benchmarking framework that evaluates video generative models under arbitrary spatio-temporal conditioning.
It employs innovative techniques like In-Context Conditioning with Temporal RoPE Interpolation to enhance temporal alignment and output fidelity.
The framework supports varied tasks such as image-to-video, inpainting, and creative scene transitions while addressing compute and quality trade-offs.

VideoCanvasBench is a benchmarking framework developed to systematically evaluate arbitrary spatio-temporal video completion. Unlike traditional benchmarks that rely on rigid or single-frame conditioning scenarios, VideoCanvasBench enables the assessment of video generative models under conditions where user-specified spatial patches or full frames may be applied at any timestamp and location within a video sequence. This approach unifies diverse controllable video generation tasks—including image-to-video, inpainting, extension, interpolation, and creative cross-scene transitions—within a cohesive evaluation paradigm, providing comprehensive coverage for both intra-scene fidelity and inter-scene creativity.

1. Conceptual Foundation and Motivation

The core motivation behind VideoCanvasBench is to address the limitations of existing evaluations for video generative models, which typically operate only within fixed conditioning paradigms, such as first-frame conditioning or rigid inpainting masks. VideoCanvasBench enables "painting" on a video canvas—i.e., placing arbitrary content patches at any location and any timestamp within a video sequence. The benchmark's scope comprehensively covers tasks including:

AnyP2V (Any-Timestamp Patch-to-Video): Sparse spatial patch conditions at selected anchors.
AnyI2V (Any-Timestamp Image-to-Video): Complete frames supplied at non-fixed timestamps.
AnyV2V (Video-to-Video): Full video-level scenarios including inpainting, outpainting, and creative inter-scene transitions.

This generalized framework directly challenges video models to maintain temporal consistency, spatial coherence, and semantic integrity across arbitrary conditioning schemes.

2. Task Formulation and Categories

In VideoCanvasBench, a video sequence is formalized as $X = \{ x_0, x_1, ..., x_{T-1} \}$ , and the set of user-specified conditions is given as:

$\mathcal{P} = \{ (p_i, m_i, t_i) \}_{i=1}^M$

where $p_i$ is the content patch, $m_i$ is its spatial mask, and $t_i$ is the conditioning timestamp. The benchmark defines the objective for arbitrary spatio-temporal completion as:

$\hat{X}[t_i] \odot m_i \approx p_i, \quad \forall i \in [1, M]$

with $\odot$ denoting the masked pixelwise operation. Benchmarked task categories include:

Task Category	Conditioning Form	Purpose
AnyP2V	Sparse patches at anchors	Interpolation, spatially sparse completions
AnyI2V	Full frames at arbitrary times	Generalized image-to-video completion
AnyV2V	Video or multi-frame cues	Inpainting, outpainting, scene transition

This structure facilitates controlled evaluation of both local reconstruction and global video coherence.

3. Methodology and Evaluation Metrics

The experimental methodology leverages a latent video diffusion model with a Diffusion Transformer (DiT) backbone, fine-tuned for 20,000 steps on approximately 650,000 high-quality video clips, each 5 seconds long at $384 \times 672$ resolution. Conditioning paradigms under comparison include Latent Replacement, Channel Concatenation, and the proposed In-Context Conditioning (ICC) with Temporal RoPE Interpolation.

Temporal RoPE Interpolation enables fractional assignment of conditional token positions within the latent video sequence:

$\text{pos}_t(z_{\text{cond}}, i) = t_i / N$

where $N$ is the VAE's temporal stride, thus resolving ambiguity induced by causal VAEs that compress multiple pixel frames into single latents.

Evaluation metrics comprise both automated and human-assessed criteria:

Metric	Scope
PSNR	Pixel-level fidelity (conditioned regions)
Fréchet Video Distance (FVD)	Temporal and spatial perceptual similarity
Aesthetic Quality	Artistic and visual appeal
Imaging Quality	Artifact and distortion quantification
Temporal Coherence	Motion smoothness, intra-scene consistency
Dynamic Degree	Strength/intensity of motion

User studies further rate outputs along Visual Quality, Semantic Quality, and Overall Preference, with a test set exceeding 2,000 benchmark cases and 25 human evaluators.

4. Technical Innovations

VideoCanvasBench directly evaluates the efficacy of In-Context Conditioning (ICC), which concatenates independently encoded conditional tokens (using spatial zero-padding) with latent video inputs, requiring no new model parameters. This hybrid strategy decouples spatial placement from temporal alignment, which is performed using RoPE Interpolation.

The training objective for the DiT is:

$\mathcal{L}_{\mathrm{FM}}(\theta) = \mathbb{E} \left[ \left\| f_\theta(z_t, t, c_{\text{text}}) - ( -z_{\text{source}} + \varepsilon ) \right\|^2 \right]$

where $z_t$ encodes the combination of conditioned latent and noise, and loss is computed only over non-conditional regions.

Ablation studies reveal that lack of RoPE Interpolation results in shifted PSNR peaks and misalignment, while ICC with RoPE achieves precise alignment and high-fidelity motion. Compared to Latent Replacement (yielding static outputs) and Channel Concatenation (incurring artifacts), ICC consistently delivers superior perceptual and motion coherence.

5. Comparative Results and Analysis

Quantitative and qualitative analyses show that models employing ICC and Temporal RoPE Interpolation substantially outperform alternatives, as measured by improvements in PSNR, reductions in FVD, higher dynamic degree (natural motion), and superior aesthetic/imaging quality. Human evaluation statistics indicate that ICC-based completions are preferred across tasks, with strengths in preserving identity and event continuity even under sparsely conditioned scenarios.

The trade-off of increased compute for dense conditioning, resulting from concatenation of multiple context tokens, is exhibited, though benefits in quality and controllability remain substantial.

6. Future Directions and Extensions

Potential future developments identified include data-centric enhancements—such as incorporating zero-padded conditioning during model pre-training to improve VAE robustness; hybrid mechanisms to balance computational efficiency and fine-grained control (for instance, token pruning or selective encoding); and broader application to domains such as video conferencing (recovery from frame loss), advanced video editing, and long-form synthesis.

Architectural innovation areas include further refinement of RoPE-based temporal alignment and investigation into latent ambiguity solutions that do not require retraining or growth in model parameters.

7. Significance Within the Broader Landscape

By enabling rigorous, unified evaluation across arbitrary spatio-temporal completion tasks, VideoCanvasBench advances benchmarking for controllable video synthesis beyond the constraints of prior frameworks. The benchmark’s quantitative and perceptual criteria, together with its technical innovations (parameter-free ICC and RoPE temporal alignment), establish a new standard for evaluating flexible video generation. Plausibly, the paradigm proposed by VideoCanvasBench will inform future architectural and data-centric research directions in video generative modeling, and may serve as a reference for application-driven benchmarks requiring creative and context-aware video completion.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to VideoCanvasBench.