VideoScience-Bench: Scientific Video Evaluation
- VideoScience-Bench is a large-scale benchmark that rigorously assesses scientific reasoning in video generative models using composite prompts.
- It integrates multiple undergraduate concepts from physics and chemistry to ensure videos accurately reflect experimental phenomena.
- The benchmark employs expert annotations and an automated evaluation suite to provide detailed, multi-dimensional performance metrics.
VideoScience-Bench is a large-scale, multi-dimensional benchmark specifically developed to rigorously evaluate scientific reasoning and understanding in video generative models, with explicit focus on zero-shot undergraduate-level physical and chemical phenomena. Unlike prior video benchmarks constrained to physical commonsense or general perceptual quality, VideoScience-Bench requires models to integrate and reason over multiple scientific concepts, ensuring that generated videos not only exhibit strong visual fidelity but also encode and manifest correct scientific outcomes. It leverages a composite structure of meticulously curated prompts, exhaustive evaluation criteria, expert annotation, and an automated judge harness, setting a new standard for assessing scientific reasoning in video generation models (Hu et al., 2 Dec 2025).
1. Motivation and Scope
The primary motivation for VideoScience-Bench is the inadequacy of prior video-generation benchmarks, such as PhyGenBench and VideoPhy, which predominantly contain scenarios solvable with high-school-level intuition and focus on physical commonsense (e.g., falling objects, mirror reflections, bouncing balls). These tasks do not probe models' capacity to handle scenarios requiring genuine scientific theory integration at the undergraduate curriculum level (Hu et al., 2 Dec 2025). The goal is to enforce video generation models to solve composite scientific scenarios—each prompt demands the integration of at least two distinct principles (e.g., combining Snell’s law and diffusion for laser refraction in a gradient medium), requiring non-trivial scientific reasoning for correct generation.
VideoScience-Bench consists of 200 curated prompts: 160 text-to-video (T2V) and 40 image-to-video (I2V). The prompts cover 14 undergraduate topics spanning 9 physics and 5 chemistry areas, annotated with 103 granular concepts mapped to respective subfields such as Classical Mechanics, Optics, Redox Reactions, and Reaction Kinetics. Each prompt encodes a full experimental setup and expected phenomenon, designed to be unsolvable by superficial or commonsense heuristics.
2. Benchmark Construction and Prompt Design
Each prompt in VideoScience-Bench comprises a well-specified textual description of the scientific scenario, a constrained apparatus, and the desired phenomenon. Prompts are intentionally designed to invoke at least two scientific principles, deterring models from succeeding via template matching or memorized physical priors (Hu et al., 2 Dec 2025). For instance:
- Curved Refraction Gradient: “A transparent water tank is filled with sugar water layers of different concentrations. A laser beam enters at one side.” Expected phenomenon: “The beam bends at each interface (Snell’s law) and traces a smooth curve through the density gradient.”
- Aluminum–Iodine Reaction: Triggered only upon water addition, demanding models to understand redox reaction dynamics.
- Rotating Cups with Balls: Spinning two joined cups with internal balls, evaluating centrifugal effect realization.
Table: Topic and Concept Coverage
| High-Level Topic | #Prompts | Representative Concepts |
|---|---|---|
| Classical Mechanics | — | Newton’s Laws, Centrifugal Force |
| Optics | — | Refraction, Diffusion, Snell’s Law |
| Thermodynamics | — | Heat Transfer, Specific Heat Capacity |
| Redox Reactions | — | Catalyst Effect, Spontaneity |
| ... | — | ... |
All T2V prompts are paralleled with an I2V version, providing a first-frame image to serve as the setup cue, and reference videos are attached in the I2V split for robust reference-based evaluation.
3. Evaluation Methodology and Metrics
Videos generated in response to the prompts are scored along five exhaustive, science-grounded dimensions, each critical for assessing scientific reasoning fidelity (Hu et al., 2 Dec 2025):
- Prompt Consistency (PCS): Conformity of the video setup and protocol to the textual description.
- Phenomenon Congruency (PCG): Presence and correctness of the intended scientific effect.
- Correct Dynamism (CDN): Adherence to secondary physical laws (motion consistency, force realism).
- Immutability (IMB): Nontransformation of objects when no such change is expected.
- Spatio-Temporal Coherence (STC): Frame transition smoothness—absence of flicker, teleportation, or identity swaps.
All dimensions are ordinally rated by domain experts on a 1–4 Likert scale (1 = absent/contradictory, 4 = clearly correct). For quantitative comparison, Spearman’s ρ and Kendall’s τ are computed to correlate automated scores with human judgments.
For automation at scale, the VideoScience-Judge harness is deployed. It combines a large vision-LLM (GPT-5 pro backbone) with CV modules (e.g., Grounding DINO, RAFT, ByteTrack, CLIP4Clip) and a checklist-based evaluation template. This approach provides per-dimension evidence and explanations, min–max–normalized for standardization. Correlation between VideoScience-Judge and human scores reaches τ = 0.90, ρ = 0.96, substantially surpassing other automated baselines.
4. Dataset, Access, and Annotation Process
- Prompt and Phenomenon Data: Fully open-sourced prompt collection (T2V and I2V splits), expected result descriptions, corresponding first-frame images, and reference videos (I2V) are provided in structured JSON formats.
- Expert Annotation: Eight graduate-level researchers author prompts and act as primary annotators; another expert group rates generations, with all videos processed independently by at least two domain experts per dimension.
- Evaluation Toolkit: The VideoScience-Judge full codebase and prompt templates, along with CV pipeline integration scripts, are released for reproducibility and future benchmarking (Hu et al., 2 Dec 2025).
5. Model Performance and Failure Analysis
Across seven state-of-the-art video models (Sora-2, Veo-3, Seedance-1.0-Pro, Kling-v2.5-Turbo-Pro, Hailuo-2.3, Ray2, Wan-2.5-T2V-Preview), performance is systematically lower on scientific reasoning tasks compared to generic video benchmarks. For instance:
- Sora-2 (best performer): PCS 3.32/4 (83%), PCG 2.56/4 (64%), CDN 3.33/4 (83%), IMB 3.73/4 (93%), STC 3.71/4 (93%)
- Phenomenon Congruency remains the most challenging: Sora-2 achieves only 64%; others (e.g., Ray2 PCG = 1.26/4, only 31%) substantially lower.
Common error modes include:
- Omission or mis-rendering of key scientific effects (e.g., missed liquid layering, failed Newton's cradle swing).
- Setup inconsistencies or oversimplification—prompt compliance is frequently violated in multi-object or compositional tasks.
- While identity preservation (IMB) scores are typically high (> 0.8), correct dynamism (CDN) lags; many models accurately avoid object swaps but break subtle physical motion constraints.
VideoScience-Judge enables rapid discovery of these error classes, surfacing the bottleneck in scientific reasoning and process fidelity, not visual realism.
6. Comparison with Related Benchmarks
VideoScience-Bench directly addresses the crucial scientific reasoning gap left by VBench (Huang et al., 2023), VBench++ (Huang et al., 20 Nov 2024), and domain-agnostic metrics. These earlier benchmarks decompose video quality into sixteen perceptual and compositional dimensions (e.g., Subject Consistency, Motion Smoothness, Frame-Wise Quality, Semantics, Style), validated by large-scale human preference annotation (Huang et al., 20 Nov 2024). However, none require explicit mechanism-level reasoning combining multiple undergraduate scientific concepts, nor do their prompt suites target experimental outcome realization under controlled manipulation.
VBench++ introduces a four-axis trustworthiness component (culture, gender, skin-tone fairness, safety), and a high-quality, adaptive aspect-ratio I2V suite. VideoScience-Bench focuses on scientific reasoning fidelity and causal understanding, with evaluation dimensions tailored to protocol adherence and scientific effect realization. A plausible implication is that, while VBench++ provides compositional and stylistic probe coverage, only VideoScience-Bench can drive fundamental advances in models capable of simulating and reasoning about real-world scientific laws (Hu et al., 2 Dec 2025, Huang et al., 20 Nov 2024).
7. Impact and Outlook
VideoScience-Bench establishes a new benchmark paradigm, demanding video models generate outputs rooted in mechanistic scientific understanding, not generic physical plausibility or perceptual quality. It exposes significant gaps in current model capabilities—even high-fidelity generators often fail at true scientific process execution and compositional reasoning. The explicit scoring along five scientific attributes, combined with the VideoScience-Judge automation stack, supports reproducible, large-scale evaluation.
Future directions include augmenting the benchmark with real-time experiment control scenarios, integrating more granular quantitative physical evaluation, and expanding to additional scientific domains. A plausible implication is that, as models are fine-tuned on such benchmarks, advances will propagate toward physically grounded video synthesis, automated scientific documentation, and intelligent experimental simulation (Hu et al., 2 Dec 2025). The open availability of prompts, annotation code, and evaluation tools is designed to catalyze reproducible progress and establish VideoScience-Bench as a central infrastructure for the field.