VideoKR Corpus: Enhancing Video Reasoning
- VideoKR is a large-scale training corpus for multimodal video reasoning, containing 315,537 QA examples derived from expert-curated videos across diverse academic disciplines.
- It employs a human-in-the-loop, skill-oriented generation pipeline with progressive reasoning depth and integrated chain-of-thought rationales to enhance model performance.
- The VideoKR-Eval benchmark minimizes shortcut guessing by reducing single-frame answerability, ensuring genuine multi-frame, knowledge-intensive video analysis.
VideoKR is a large-scale training corpus specifically developed to strengthen knowledge- and reasoning-intensive video understanding in multimodal LLMs (MLLMs). It comprises 315,537 video reasoning examples curated from 145,000 newly collected, Creative Commons–licensed expert-domain videos. VideoKR introduces a human-in-the-loop, skill-oriented example generation pipeline, stratifies data for progressive reasoning depth, and integrates comprehensive chain-of-thought (CoT) rationales. VideoKR-Eval, a dedicated evaluation benchmark, ensures that test items require genuine video understanding and precludes shortcut guessing through textual or single-frame reasoning. Under a standard supervised fine-tuning (SFT) followed by Guided Reinforcement Policy Optimization (GRPO) pipeline, models post-trained on VideoKR demonstrate quantitative gains for knowledge-intensive video reasoning and remain competitive on general video reasoning benchmarks, emphasizing the dataset’s influence on advancing video-centric AI (Fu et al., 3 Jun 2026).
1. Corpus Composition and Domain Coverage
VideoKR is constructed from 145,000 CC-licensed videos (average duration 344.1 seconds) sampled across 82 undergraduate subjects within four primary disciplines: Natural Sciences, Engineering, Healthcare, and Humanities & Social Sciences. Each video is associated with “knowledge points” curated into a four-layer hierarchical Domain Knowledge Bank (Subject → Course → Lecture → Knowledge Point), totaling 63,745 knowledge points.
A total of 315,537 question–answer (QA) examples are generated, divided into:
- VideoKR-SFT-201K for supervised fine-tuning (SFT)
- VideoKR-RL-114K for reinforcement learning (RL)
Videos are retrieved through scenario-guided search on YouTube and filtered for domain relevance and licensing. Examples are stratified into three skill categories:
- Basic Video Reasoning (VidR): perceptual and temporal reasoning
- Knowledge-enhanced Video Perception (KnowVid): domain-grounded object/term recognition
- Knowledge-Intensive Video Reasoning (KnowVidR): multi-hop inference integrating visual evidence with external knowledge
2. Human-in-the-Loop Skill-Oriented Example Generation
The annotation process employs a skill-oriented, human-in-the-loop pipeline to ensure increasing reasoning depth and high-quality CoT rationales:
- Seed Construction: 1,800 expert-curated seed examples (150 per skill per discipline), each thoroughly reviewed and supplemented with detailed CoT traces.
- Scaled Example Generation: For each video, frames are uniformly sampled at 0.2 fps. Frontier MLLMs such as GPT-5.2 and Claude-4.5-Sonnet are prompted with three in-domain, in-skill seed examples and a target knowledge point to generate two QA examples per skill.
- Example Validation and Filtering: Three major steps eliminate flawed examples:
- Self-Consistency Verification (model must reproduce original answer given generated question and frames)
- Video Dependency Filtering (using InternVL3.5-38B and Qwen3-VL-32B-Instruct, examples solved with single-frame context are discarded)
- CoT Rationale Validation (an independent verifier MLLM checks step-level evidence against the video and standard knowledge; unsupported reasoning is pruned)
Skill stratification (from VidR to KnowVidR) and human oversight in final model selection (≤3% error rate on a 100-instance pilot) mitigate over-reliance on specific model biases, fostering dataset diversity and reliability.
3. Annotation Format, CoT Rationales, and Quality Assurance
Annotation includes both multiple-choice and open-ended QA pairs. VideoKR-SFT examples retain validated CoT rationales, formatted as:
1 2 |
<think> … step-by-step reasoning grounded in frames and knowledge … </think> <answer> … final answer … </answer> |
VideoKR-RL examples retain only Q/A pairs for RL-style reward calculation. Quality control comprises:
- Manual audit of 800 random SFT examples (52 flagged as non-visual solvable; 32 CoT trace errors—17 affecting answers, 15 with unsupported claims), with error rates equivalent to initial seed curation.
- Decontamination to ensure no overlap between train/eval sets, involving the removal of 131 videos with duplicate evaluation YouTube IDs plus 877 near-duplicates detected with frame-level perceptual hashing.
4. VideoKR-Eval: Knowledge- and Reasoning-Intensive Benchmark
VideoKR-Eval was designed to remedy the shortcut solvability present in prior benchmarks (VideoMMMU, MMVU, SciVideoBench), which exhibited single-frame answerability rates above 35%.
Construction proceeds as follows:
- Multi-Model Single-Frame Filtering: Of 2,900 initial examples, only those unsolved by three state-of-the-art models (Qwen3-VL-235B, Claude-4.5-Sonnet, GPT-5.2) after three single-frame trials are retained (1,254 examples).
- Expert Re-annotation: For 1,646 further filtered items, domain experts annotate 746 new, visually grounded QA pairs requiring continuous video analysis.
The final VideoKR-Eval consists of 2,000 examples (multiple-choice and open-ended). Single-frame answerability is reduced to approximately 10% (versus 35–49% in legacy datasets), resulting in an evaluation set that sharply distinguishes models requiring true video-temporal reasoning and knowledge-intensive inference.
5. Training and Evaluation Methodology
Training on VideoKR follows a two-stage pipeline:
- Supervised Fine-Tuning (SFT): One epoch on VideoKR-SFT-201K (batch size 32, learning rate 1e-5) with loss
- Guided Reinforcement Policy Optimization (GRPO): One epoch on VideoKR-RL-114K (batch size 32, learning rate 5e-6), maximizing
where , , and is the SFT checkpoint.
Evaluation spans seven benchmarks—Video-MME, MVBench, LongVideoBench (general); VideoMMMU, MMVU, SciVideoBench, and VideoKR-Eval (knowledge-intensive)—with standardized prompting (temperature 0.1, three runs, mean reported) utilizing the LMMs-Eval framework.
6. Empirical Results and Data Characteristics
Post-training with VideoKR under the SFT+RL regimen yields the following for Qwen2.5-VL-7B-Instruct:
- Knowledge-intensive average: from 41.9% to 46.6% (+4.7), including +4.8 (MMVU) and +8.5 (VideoKR-Eval)
- General video reasoning remains competitive: from 64.1% to 65.5%
Performance monotonically increases with the number of input frames, for example: on knowledge-intensive tasks, accuracy increases from 44.2% (16 frames) to 46.6% (128 frames).
Ablation results indicate:
- Skill Composition: VidR only yields 41.4% (know-intensive), whereas VidR+KnowVid+KnowVidR yields 42.4%
- CoT Supervision: direct output supervision achieves 39.4%, while CoT yields 42.4% (+3.0 gain)
- Corpus Comparison: Of SFT-only runs, only VideoKR-SFT improves over the base (42.4% vs. 41.9%); prior corpora including Video-R1, OneThinker, and VideoRFT are ≤39%. RL-only: VideoKR-RL achieves 43.0%, slightly above VideoAuto-R1 at 42.7%
Zero-shot accuracy of Qwen2.5-VL‐7B on 3,000 random QA examples is 45.8–57.1% for other corpora but 39.2% for VideoKR, demonstrating a more challenging distribution that more effectively drives post-training improvements.
7. Significance and Implications
VideoKR’s combination of expert-guided, multi-stage example generation, progressive skill stratification, and stringent validation yields a challenging, diverse corpus for knowledge- and reasoning-intensive video understanding. Its integration of chain-of-thought rationales, minimized susceptibility to shortcut exploitation, and effective empirical gains over prior datasets indicate that data design—specifically, alignment of annotation format, validation procedures, and stratified skill coverage—constitutes a key driver of progress in video reasoning architectures. The significant reduction in single-frame answerability in VideoKR-Eval compared to legacy benchmarks further underscores its contribution to robust evaluation of video reasoning and temporal understanding capabilities in advanced MLLMs (Fu et al., 3 Jun 2026).