Papers
Topics
Authors
Recent
2000 character limit reached

SciVBench: Multimodal Science Video Benchmark

Updated 26 November 2025
  • SciVBench is a large-scale benchmark for multimodal scientific video assessment featuring 500 expert-verified QA pairs across physics, chemistry, and daily phenomena.
  • It utilizes rigorous expert verification and chain-of-thought protocols to ensure accurate evaluation of scientific reasoning and educational content generation.
  • Empirical results show that SciEducator outperforms leading models, establishing SciVBench as a critical tool in advancing AI for science education.

SciVBench is a large-scale expert-verified benchmark for multimodal scientific video understanding and education, developed to evaluate the capabilities of AI systems to interpret and teach scientific concepts from video content. It provides a rigorous, literature-grounded test suite specifically targeted at assessing scientific reasoning, knowledge grounding, and educational content generation in the context of physical, chemical, and everyday science phenomena as depicted in videos. SciVBench serves as the central evaluation resource in the paper and deployment of the SciEducator multi-agent system, but its design and construction principles have general implications for AI-based educational assessment and video comprehension (Xu et al., 22 Nov 2025).

1. Construction and Content Organization

SciVBench comprises 500 science question–answer (QA) pairs, each paired with a short scientific video carefully selected or curated from major educational platforms and science video repositories. Coverage spans five distinct categories:

  • Physics experiments (54 videos, 160 QA pairs)
  • Chemistry experiments (54 videos, 148 QA pairs)
  • Daily life phenomena (103 videos, 192 QA pairs)

Each QA pair is annotated and verified by at least two domain experts, with a third expert resolving adjudications. All items are grounded strictly on visual video content—neither subtitles nor narration are accessible to the evaluators or systems—so solutions require evidence-based stepwise reasoning from video frames alone.

Five QA types are defined to span core science educational and cognitive processes:

  1. Terminology: Defining scientific terms visible in context
  2. Principle: Explaining the underlying physical or chemical laws depicted
  3. Prediction: Anticipating the outcome of counterfactual or parameterized changes within the video scene
  4. Reading: Interpreting observed data, such as graphs or numeric readouts shown in the video
  5. Design: Proposing feasible experimental modifications or related investigations

This structure ensures the benchmark evaluates not only object recognition and basic captioning but rigorous interpretation and reasoning specific to science curricula.

2. Methodology and Expert Verification

Grounding in professional scientific knowledge and didactic rigor is achieved by the following protocol:

  • Item Authoring: Two domain experts independently author QA items and reference solutions for each video, cross-referencing relevant scientific literature and experimental manuals.
  • Adjudication: A third expert reviews for scientific validity, reasoning soundness, and pedagogical clarity, correcting or eliminating ambiguous or ill-posed items.
  • Reference Consistency: All QA content is cross-checked against published sources, ensuring alignment with canonical scientific explanations and real experimental procedures.

The design intentionally excludes reliance on audio tracks, forcing both human annotators and AI systems to extract relevant features through image and video analysis alone. This design uniquely stresses multimodal visual reasoning over text-literal matching.

3. Evaluation Protocol and Metrics

Evaluation using SciVBench occurs across two primary tasks:

(A) Scientific Video Understanding

System outputs (answer texts) are scored along two axes:

  • Relevance: Graded 0/0.5/1, measuring the pertinence and specificity to the targeted scientific subdomain.
  • Accuracy: Graded 0/0.5/1, measuring factual correctness and logic of the provided reasoning, referenced to expert-authored “chain-of-thought” rationales.

Aggregated results are reported by domain (physics, chemistry, daily life) and averaged over all five QA types.

(B) Educational Content Generation

For a subset of 40 videos, models must produce instructional procedures and safety precautions. Evaluation is conducted with an advanced multimodal LLM (Qwen-VL-Plus), scoring outputs on:

  • Relevance
  • Instruction quality (IQ)
  • Attractiveness
  • Educational value (EV)

Scores are reported as win rates versus baseline and state-of-the-art systems.

4. Empirical Results and Benchmarks

A detailed quantitative comparison is provided in (Xu et al., 22 Nov 2025). SciEducator demonstrates superior performance over leading closed-source multimodal LLMs (MLLMs) and open multi-agent video agents:

Model Physics Rel Physics Acc Chem Rel Chem Acc Daily Rel Daily Acc
GPT-4o 47.50 34.69 39.86 31.42 30.73 27.86
Gemini 2.0 Flash 52.81 38.75 46.96 36.15 34.64 31.25
VideoAgent 49.06 36.56 45.61 34.80 30.47 27.34
SciEducator (Ours) 81.88 65.31 73.97 64.86 64.58 62.24

For educational procedure generation (40-video subset), SciEducator exhibits dominant win rates (all metrics ≥77.5%), evidencing both knowledge integration and multimodal presentation capabilities.

5. Design Significance and Research Impact

By enforcing expert verification, literature grounding, and chain-of-thought referencing, SciVBench advances evaluation beyond simple visual classification or superficial QA. It provides the first comprehensive testbed purpose-built for multimodal LLMs and video agents in science education domains. The focus on domain transfer (physical, chemical, and everyday phenomena), varied cognitive QA types, and strict visual grounding enables robust benchmarking of AI systems’ scientific reasoning and educational ability.

SciVBench has thus become a reference for iterative, feedback-driven multi-agent system development, robust multimodal benchmarking, and AI-mediated science instruction evaluation.

6. Open Challenges and Future Directions

Current limits of SciVBench include coverage primarily of mid-school physics and chemistry, exclusion of advanced or rare domains, and focus on visual-only modalities. Planned extensions include:

  • Integration of biology and earth sciences
  • Multilingual QA sets
  • Addition of multimodal references (e.g., integrating accessible audio/narrative cues)
  • Item generation for real-time or streaming scenarios
  • Construction of adaptive benchmarking protocols for personalized learning evaluation

Resource optimization and fact-grounded hallucination control in system outputs, as well as agent policy learning for tool invocation, remain core tasks for follow-up research leveraging SciVBench (Xu et al., 22 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SciVBench.