VANS-Data-100K: Video Next-Event Benchmark

Updated 25 November 2025

VANS-Data-100K is a large-scale benchmark dataset for video next-event prediction, offering 100K multi-modal triplets combining visual, textual, and dynamic video answers.
The dataset employs a rigorous multi-stage pipeline—including shot splitting, clip selection, and manual verification—to ensure high-quality, clear action segments.
Evaluation protocols using metrics such as BLEU, ROUGE, and FVD, along with reinforcement learning fine-tuning, underscore its impact on advancing generative video models.

VANS-Data-100K is a large-scale benchmark dataset specifically designed for the task of Video-Next-Event Prediction (VNEP), which extends Next-Event Prediction from textual to video answers. Comprising 100,000 multi-modal triplets—each consisting of an input video, an instructional or predictive question, and an answer video paired with text—VANS-Data-100K enables systematic paper and benchmarking of models that generate dynamic video responses conditioned on both visual context and language guidance. The dataset supports procedures such as reinforcement learning fine-tuning and structured evaluation with both procedural and predictive event classes, foregrounding visual and semantic consistency in generative video modeling (Cheng et al., 20 Nov 2025).

1. Data Collection and Construction Pipeline

VANS-Data-100K is sourced from a diverse array of public datasets and online video resources, targeting both procedural and predictive scenarios. Of the 100,000 samples, 30,000 are procedural (e.g., instructional steps), and 70,000 are predictive (e.g., forecasting physical outcomes). Sources include YouCook2 (9,000), COIN (21,000), Video-Holmes (10,000), ActivityNet (20,000), V1-33K (10,000), and a curated set of 30,000 YouTube videos representing complex, varied event dynamics.

The pipeline follows a multi-stage process:

Shot splitting: Raw videos are segmented into event-level clips, isolating contiguous action.
Clip selection: Event segments are filtered to ensure sufficient clarity, resolution (minimum 352×640 px after resizing), and focus on a single dominant action and camera shot.
Question-Answer generation: Instructional or predictive questions are programmatically or manually associated with each input clip, along with ground-truth answer videos and corresponding reference text based on the subsequent event.
Manual verification: A subset of 1,000 high-quality triplets is reserved as a human-verified set to facilitate reinforcement learning fine-tuning and robust evaluation.

All samples must meet strict filtering criteria: adequate resolution, clear human-action focus, and a ground-truth answer validated for both visual and semantic consistency. Manual validation ensures the top 1,000 validation triplets are of especially high quality and utility for benchmarks (Cheng et al., 20 Nov 2025).

2. Dataset Composition and Characteristics

VANS-Data-100K includes two major scenario categories:

Procedural (N_proc = 30,000): Tasks requiring accurate depiction of sequential actions (YouCook2, COIN).
Predictive (N_pred = 70,000): Tasks emphasizing physical prediction, future state reasoning, or activity forecasting (Video-Holmes, ActivityNet, V1-33K, YouTube).

Video Statistics:

All videos are provided at a fixed resolution of 352×640 pixels. The average input clip length is 9.43 seconds, while the target “next event” answer video averages 3.76 seconds and is standardized to 33 frames for model input and evaluation. The explicit frame rate is unspecified; the annotation suggests models may re-encode or subsample as needed.

Text Modality:

Each triplet includes a paired textual question, split into procedural (“What is the next cooking step?”) and predictive (“What happens next?”) templates. Ground-truth answers use a “reason-then-answer” format to support multi-modal reasoning, although specific statistics on caption length and vocabulary size are not detailed in the source. All answers are evaluated for semantic correctness and visual relevance to the input video context.

3. Preprocessing, Formatting, and Features

Preprocessing standardizes audiovisual data and provides precomputed features for efficient model training and inference:

Clip Segmentation and Filtering: Automated shot detection is followed by curation, excluding segments with multiple concurrent actions or suboptimal quality.
Resizing and Encoding: All video content is resized to 352×640 and encoded as MP4/H.264.
Frame Sampling and Tokenization: For certain models (e.g., vision-language transformers or diffusion models), six reference frames are sampled and tokenized via a Variational Autoencoder (VAE) to yield discrete latent representations.

Input/Output Schema:

Input: raw video plus textual question
Output: answer video (33 frames, 352×640 px) plus textual caption

Precomputed Features:

Vision-LLM (VLM) receives Vision Transformer (ViT) features of the full video.
Video diffusion model (VDM) receives VAE-tokenized features for the sampled frames.

No additional per-sample metadata (such as bounding boxes or dense action labels) is provided, focusing the benchmark strictly on event prediction and generative competence (Cheng et al., 20 Nov 2025).

4. Benchmark Tasks and Evaluation Methodology

The principal benchmark task enabled by VANS-Data-100K is Video-Next-Event Prediction (VNEP). Given an input video $X$ and question $q$ , the model must generate a “next event” video $Y$ and an interpretable intermediate caption $s$ (which serves as a bridge in VLM→VDM orchestration).

Metrics used for evaluation:

BLEU-n@k: n-gram matching for the generated captions, penalized for brevity.
ROUGE-L: Longest common subsequence F-measure between generated and ground-truth captions.
Fréchet Video Distance (FVD): Compares distributional statistics (means and covariances) of features from real and generated video samples:

$\mathrm{FVD}=\|m_r-m_g\|^2+\mathrm{Tr}(C_r + C_g - 2\,(C_r C_g)^{1/2})$

where $(m_r, C_r)$ and $(m_g, C_g)$ are means and covariances of real and generated feature distributions, respectively.

CLIP-V: Measures visual consistency by averaging the cosine similarity between CLIP image embeddings of each generated frame and its true counterpart.
CLIP-T: Assesses semantic consistency between the CLIP text embedding of the generated caption and the video embedding of the generated output.

These metrics jointly capture linguistic accuracy, visual fidelity, and semantic alignment.

5. Baseline Results and Comparative Analysis

A range of generative video modeling baselines and recent models have been evaluated on VANS-Data-100K for both procedural and predictive settings. Representative results for the procedural task are summarized as follows:

Model	BLEU@4 ↑	ROUGE-L ↑	FVD ↓	CLIP-V ↑	CLIP-T ↑
Video-GPT	–	–	105.32	0.7334	0.1997
Omni-Video	0.0008	0.1075	236.38	0.6293	0.2323
Gemini-FilmWeaver	0.0215	0.2802	110.54	0.7102	0.2773
VANS (SFT)	0.0233	0.2812	85.34	0.7655	0.3202
VANS (Joint-GRPO)	0.0987	0.3631	78.32	0.8021	0.3824

For predictive benchmarks, VANS (Joint-GRPO) achieves BLEU@4 = 0.0694, ROUGE-L = 0.3058, FVD = 86.85, CLIP-V = 0.7872, CLIP-T = 0.3759, indicating substantial improvement over prior models in both standard and reinforcement learning-augmented configurations.

6. Protocols, Design Choices, and Downloadable Artifacts

Supervised Fine-Tuning (SFT):

Vision-LLM (VLM): Qwen2.5-VL-3B with LoRA (rank=8, $\alpha$ =32), learning rate $5\times10^{-5}$ , trained for 10,000 steps.
Video Diffusion Model (VDM): Wan-2.1-1.3B, full fine-tuning, learning rate $5\times10^{-5}$ , trained for 20,000 steps.

Joint-GRPO Post-Training:

Stage 1 (VLM co-steering): 800 steps, learning rate $5\times10^{-5}$ , reward function combines final, text, and vision rewards equally.
Stage 2 (VDM adaptation): 1,000 steps, anchors with ROUGE-L ≥ 0.6, reward is a weighted sum of vision and CLIP rewards.
KL cost $\beta=0.004$ , reward clipping $=1{\mathrm{e}{-3}$, group size $G=8$ samples per prompt.
All models are required to output exactly 33-frame videos at 352×640 px for consistency in benchmarking.

Code and downloading instructions are available at https://github.com/KlingTeam/VANS (Cheng et al., 20 Nov 2025).

7. Significance and Limitations

VANS-Data-100K addresses a previously under-explored need for large-scale, multi-modal benchmarks in event-conditional video generation—especially for tasks demanding both visual and linguistic reasoning across procedural and predictive domains. Its rigorous curation pipeline, category balance, and state-of-the-art evaluation metrics systematically support robust empirical assessment and improvement of next-event generative video models.

A plausible implication is that the dataset may accelerate research into instruction-following, physical commonsense reasoning, and multi-modal answer generation, as traditional text-only NEP fails to capture the richness of many real-world tasks. The absence of detailed caption statistics and the lack of explicit per-sample metadata limit certain forms of structured scene or action analysis, suggesting a potential avenue for future dataset enrichment.

VANS-Data-100K thus constitutes a pivotal infrastructure for training and evaluating vision-language and video generative models, defining baseline performance and stimulating methodological advances in video-as-answer paradigms (Cheng et al., 20 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to VANS-Data-100K.