VAP-Data: Unified Paired Video Dataset

Updated 27 October 2025

VAP-Data is a large-scale, unified paired video dataset designed to enable semantic-controlled video generation with diverse, 100-condition annotations.
The dataset supports general-purpose, controllable video synthesis by leveraging in-context generation with a diffusion-based VAP model architecture.
Experiments reveal robust zero-shot performance and high user preference rates, establishing VAP-Data as a benchmark for semantic video editing and style transfer.

VAP-Data is a large-scale, unified paired video dataset developed to advance semantic-controlled video generation. It acts as the primary training resource for the Video-As-Prompt (VAP) model, which reframes video synthesis as an in-context generation task using a reference video as a direct semantic prompt. In contrast to prior approaches that rely on domain-specific data or pixel-level conditioning, VAP-Data enables the construction of general-purpose, controllable video generation models that can handle a diverse range of semantics. Its organization across 100 semantic conditions with over 100,000 paired video samples provides both scale and diversity, facilitating strong generalization and robust zero-shot performance for highly varied downstream video generation tasks (Bian et al., 23 Oct 2025).

1. Purpose and Motivation

VAP-Data was curated to address shortcomings in existing semantic video control datasets, which are often limited to structure-based controls (e.g., depth or flow maps) or specialized conditional scenarios. In structure-controlled methods, the conditioning modality enforces direct pixel-to-pixel priors with the target video—applicable for structural guidance but fundamentally incapable of capturing broader semantic transformations such as style, concept, or camera motion. In semantic-controlled generation, these pixelwise alignment constraints are inappropriate and restrict generalization. Existing methods typically suffer from fragmentation, requiring custom models or finetuning for each distinct condition, leading to non-unified systems unable to perform well under semantic variation.

VAP-Data was developed as the first large-scale, unified resource enabling single-model semantic control across concept, style, motion, and camera dynamics. The collection of over 100K high-quality video pairs, covering 100 named semantic conditions, addresses the need for standardization and comprehensive coverage, laying the foundation for unified, generalizable control in the video generation domain (Bian et al., 23 Oct 2025).

2. Dataset Structure and Semantic Condition Taxonomy

VAP-Data consists of more than 100,000 paired videos, each annotated with one of 100 semantic control conditions. The conditions are organized in four principal categories, with further subgroups:

Category	Subcategory	Example Conditions
Concept	Entity Transformation	"get thinner", "hair swap", "cartoon doll"
	Entity Interaction	"emoji figure", "couple walk", "gun shooting"
Style	—	"Ghibli", "american comic", "painting", "simpsons comic"
Motion	Human Motion Transfer	"shake it dance", "walking forward"
	Non-human Motion	"spin360", "balloon flyaway", "paper fall"
Camera Movement	—	"dolly effect", "hitchcock zoom", "orbit", "zoom in/out"

Each entry is a (reference video, target video) pair: the reference video expresses the desired semantic (e.g., a particular style or action), while the target video realizes the intended semantic transformation. The semantic condition is specified using industry API labels or community LoRA definitions, and each condition typically has several thousand paired samples.

This organization enables training on diverse transformations, including abstract concept transfers (e.g., transforming a human into a cartoon character), style application, motion pattern transfer, and camera operation simulation. The variety and size of VAP-Data allow for robust learning of generalized semantic correspondence between reference and target videos in a paired setting (Bian et al., 23 Oct 2025).

3. Methodological Integration in Video-As-Prompt Framework

In the VAP framework, VAP-Data is used to train a unified video generation model using in-context conditioning. Each data sample is encoded as follows:

Both the reference and the target video are passed through a variational autoencoder (VAE) to obtain latent token sequences.
Textual captions associated with each sample are concatenated to the visual tokens, enabling multimodal context for generation.
The core generation backbone is a frozen Video Diffusion Transformer (DiT), responsible for mapping Gaussian-noised latent video representations to denoised ones.
To inject semantic structure from the reference video, a plug-and-play Mixture-of-Transformers (MoT) expert is added—a trainable module that processes the latent tokens of the reference video and exchanges information with each pre-trained DiT layer via full cross-attention.

The combination of latent in-context token sequences, temporally biased Rotary Position Embedding (RoPE), and the MoT expert ensures that the generative process is guided by the abstract semantic properties of the prompt video without requiring direct pixel-wise alignment. This methodology leverages the diversity of semantic conditions in VAP-Data, allowing robust and contextually aware conditional generation (Bian et al., 23 Oct 2025).

4. Technical Specifications and Learning Objective

Training employs a diffusion-based loss defined as the mean-squared deviation between the network’s velocity prediction and the ground truth:

$\mathcal{L} = \mathbb{E}_{t, \mathbf{x}_0, \mathbf{x}_1, C}\left\| u_\Theta(\mathbf{x}_t, t, C) - \left[\mathbf{x}_1 - (1-\sigma_{min})\,\mathbf{x}_0\right] \right\|$

where $u_\Theta$ corresponds to the stacked DiT–MoT architecture. $\mathbf{x}_0$ and $\mathbf{x}_1$ are the less-noisy and noisy video latents, and $\sigma_{min}$ is a noise floor parameter. C denotes the in-context conditioning, including both video and text. The temporally biased RoPE ensures the system attends to semantic, not pixel, correspondences between prompt and target videos.

The MoT branch’s involvement at every DiT layer, with information flow via cross-attention, is central to effective semantic transfer. This architecture prevents catastrophic forgetting and spurious pixel-level priors, allowing efficient retrieval and imposition of abstract video semantics (Bian et al., 23 Oct 2025).

5. Benchmark Results and Empirical Analysis

Experiments on VAP-Data underpin the unified training of the VAP model, which outperforms existing open-source semantic-controlled video generators. Performance is measured via standard alignment and perceptual metrics, including CLIP score (for text-to-video relevance), motion smoothness, dynamic degree, aesthetic metrics, and a dedicated semantic alignment score assessed using Gemini-2.5-pro.

Notably, in a human preference study involving video generation researchers, the VAP model—trained solely on VAP-Data—achieved a 38.7% user preference rate, a result that rivals leading commercial condition-specific models. The model demonstrates strong compositional generalization (i.e., synthesizing new semantic transformations not explicitly witnessed during training) and supports zero-shot applications—an attribute directly enabled by the breadth and diversity of VAP-Data (Bian et al., 23 Oct 2025).

6. Applications and Future Research Considerations

VAP-Data catalyzes a range of novel downstream applications:

Semantic video editing: Changing core concepts (e.g., human to cartoon) or applying motion/camera manipulations without pixel-level mapping.
Style transfer and rendering: Applying one of several artistic or visual themes to arbitrary source videos.
Motion transfer: Transposing movement patterns across both human and non-human actors or objects.
Camera dynamics imitation: Generating videos reflecting cinematic effects such as dolly-zoom, orbit, or rapid focus shifts.

Limitations include possible inherited biases from template-driven synthetic generation (the dataset is built using commercial APIs and LoRA models), and the need for further expansion with more real-world data and improved captioning schemes. Enhancements such as moving beyond template-based captions toward instruction-driven formats may further improve alignment and control, suggesting promising directions for dataset and framework extension.

VAP-Data thus serves as both a comprehensive resource for current semantic control research and as a launchpad for future advancements in unified, open-domain, and creative video generation (Bian et al., 23 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Video-As-Prompt: Unified Semantic Control for Video Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to VAP-Data.