VideoCanvasBench: Spatio-Temporal Video Benchmark
- VideoCanvasBench is a benchmarking framework that evaluates video generative models under arbitrary spatio-temporal conditioning.
- It employs innovative techniques like In-Context Conditioning with Temporal RoPE Interpolation to enhance temporal alignment and output fidelity.
- The framework supports varied tasks such as image-to-video, inpainting, and creative scene transitions while addressing compute and quality trade-offs.
VideoCanvasBench is a benchmarking framework developed to systematically evaluate arbitrary spatio-temporal video completion. Unlike traditional benchmarks that rely on rigid or single-frame conditioning scenarios, VideoCanvasBench enables the assessment of video generative models under conditions where user-specified spatial patches or full frames may be applied at any timestamp and location within a video sequence. This approach unifies diverse controllable video generation tasks—including image-to-video, inpainting, extension, interpolation, and creative cross-scene transitions—within a cohesive evaluation paradigm, providing comprehensive coverage for both intra-scene fidelity and inter-scene creativity.
1. Conceptual Foundation and Motivation
The core motivation behind VideoCanvasBench is to address the limitations of existing evaluations for video generative models, which typically operate only within fixed conditioning paradigms, such as first-frame conditioning or rigid inpainting masks. VideoCanvasBench enables "painting" on a video canvas—i.e., placing arbitrary content patches at any location and any timestamp within a video sequence. The benchmark's scope comprehensively covers tasks including:
- AnyP2V (Any-Timestamp Patch-to-Video): Sparse spatial patch conditions at selected anchors.
- AnyI2V (Any-Timestamp Image-to-Video): Complete frames supplied at non-fixed timestamps.
- AnyV2V (Video-to-Video): Full video-level scenarios including inpainting, outpainting, and creative inter-scene transitions.
This generalized framework directly challenges video models to maintain temporal consistency, spatial coherence, and semantic integrity across arbitrary conditioning schemes.
2. Task Formulation and Categories
In VideoCanvasBench, a video sequence is formalized as , and the set of user-specified conditions is given as:
where is the content patch, is its spatial mask, and is the conditioning timestamp. The benchmark defines the objective for arbitrary spatio-temporal completion as:
with denoting the masked pixelwise operation. Benchmarked task categories include:
Task Category | Conditioning Form | Purpose |
---|---|---|
AnyP2V | Sparse patches at anchors | Interpolation, spatially sparse completions |
AnyI2V | Full frames at arbitrary times | Generalized image-to-video completion |
AnyV2V | Video or multi-frame cues | Inpainting, outpainting, scene transition |
This structure facilitates controlled evaluation of both local reconstruction and global video coherence.
3. Methodology and Evaluation Metrics
The experimental methodology leverages a latent video diffusion model with a Diffusion Transformer (DiT) backbone, fine-tuned for 20,000 steps on approximately 650,000 high-quality video clips, each 5 seconds long at resolution. Conditioning paradigms under comparison include Latent Replacement, Channel Concatenation, and the proposed In-Context Conditioning (ICC) with Temporal RoPE Interpolation.
Temporal RoPE Interpolation enables fractional assignment of conditional token positions within the latent video sequence:
where is the VAE's temporal stride, thus resolving ambiguity induced by causal VAEs that compress multiple pixel frames into single latents.
Evaluation metrics comprise both automated and human-assessed criteria:
Metric | Scope |
---|---|
PSNR | Pixel-level fidelity (conditioned regions) |
Fréchet Video Distance (FVD) | Temporal and spatial perceptual similarity |
Aesthetic Quality | Artistic and visual appeal |
Imaging Quality | Artifact and distortion quantification |
Temporal Coherence | Motion smoothness, intra-scene consistency |
Dynamic Degree | Strength/intensity of motion |
User studies further rate outputs along Visual Quality, Semantic Quality, and Overall Preference, with a test set exceeding 2,000 benchmark cases and 25 human evaluators.
4. Technical Innovations
VideoCanvasBench directly evaluates the efficacy of In-Context Conditioning (ICC), which concatenates independently encoded conditional tokens (using spatial zero-padding) with latent video inputs, requiring no new model parameters. This hybrid strategy decouples spatial placement from temporal alignment, which is performed using RoPE Interpolation.
The training objective for the DiT is:
where encodes the combination of conditioned latent and noise, and loss is computed only over non-conditional regions.
Ablation studies reveal that lack of RoPE Interpolation results in shifted PSNR peaks and misalignment, while ICC with RoPE achieves precise alignment and high-fidelity motion. Compared to Latent Replacement (yielding static outputs) and Channel Concatenation (incurring artifacts), ICC consistently delivers superior perceptual and motion coherence.
5. Comparative Results and Analysis
Quantitative and qualitative analyses show that models employing ICC and Temporal RoPE Interpolation substantially outperform alternatives, as measured by improvements in PSNR, reductions in FVD, higher dynamic degree (natural motion), and superior aesthetic/imaging quality. Human evaluation statistics indicate that ICC-based completions are preferred across tasks, with strengths in preserving identity and event continuity even under sparsely conditioned scenarios.
The trade-off of increased compute for dense conditioning, resulting from concatenation of multiple context tokens, is exhibited, though benefits in quality and controllability remain substantial.
6. Future Directions and Extensions
Potential future developments identified include data-centric enhancements—such as incorporating zero-padded conditioning during model pre-training to improve VAE robustness; hybrid mechanisms to balance computational efficiency and fine-grained control (for instance, token pruning or selective encoding); and broader application to domains such as video conferencing (recovery from frame loss), advanced video editing, and long-form synthesis.
Architectural innovation areas include further refinement of RoPE-based temporal alignment and investigation into latent ambiguity solutions that do not require retraining or growth in model parameters.
7. Significance Within the Broader Landscape
By enabling rigorous, unified evaluation across arbitrary spatio-temporal completion tasks, VideoCanvasBench advances benchmarking for controllable video synthesis beyond the constraints of prior frameworks. The benchmark’s quantitative and perceptual criteria, together with its technical innovations (parameter-free ICC and RoPE temporal alignment), establish a new standard for evaluating flexible video generation. Plausibly, the paradigm proposed by VideoCanvasBench will inform future architectural and data-centric research directions in video generative modeling, and may serve as a reference for application-driven benchmarks requiring creative and context-aware video completion.