Papers
Topics
Authors
Recent
Search
2000 character limit reached

VideoKR Corpus: Enhancing Video Reasoning

Updated 9 June 2026
  • VideoKR is a large-scale training corpus for multimodal video reasoning, containing 315,537 QA examples derived from expert-curated videos across diverse academic disciplines.
  • It employs a human-in-the-loop, skill-oriented generation pipeline with progressive reasoning depth and integrated chain-of-thought rationales to enhance model performance.
  • The VideoKR-Eval benchmark minimizes shortcut guessing by reducing single-frame answerability, ensuring genuine multi-frame, knowledge-intensive video analysis.

VideoKR is a large-scale training corpus specifically developed to strengthen knowledge- and reasoning-intensive video understanding in multimodal LLMs (MLLMs). It comprises 315,537 video reasoning examples curated from 145,000 newly collected, Creative Commons–licensed expert-domain videos. VideoKR introduces a human-in-the-loop, skill-oriented example generation pipeline, stratifies data for progressive reasoning depth, and integrates comprehensive chain-of-thought (CoT) rationales. VideoKR-Eval, a dedicated evaluation benchmark, ensures that test items require genuine video understanding and precludes shortcut guessing through textual or single-frame reasoning. Under a standard supervised fine-tuning (SFT) followed by Guided Reinforcement Policy Optimization (GRPO) pipeline, models post-trained on VideoKR demonstrate quantitative gains for knowledge-intensive video reasoning and remain competitive on general video reasoning benchmarks, emphasizing the dataset’s influence on advancing video-centric AI (Fu et al., 3 Jun 2026).

1. Corpus Composition and Domain Coverage

VideoKR is constructed from 145,000 CC-licensed videos (average duration 344.1 seconds) sampled across 82 undergraduate subjects within four primary disciplines: Natural Sciences, Engineering, Healthcare, and Humanities & Social Sciences. Each video is associated with “knowledge points” curated into a four-layer hierarchical Domain Knowledge Bank (Subject → Course → Lecture → Knowledge Point), totaling 63,745 knowledge points.

A total of 315,537 question–answer (QA) examples are generated, divided into:

Videos are retrieved through scenario-guided search on YouTube and filtered for domain relevance and licensing. Examples are stratified into three skill categories:

  • Basic Video Reasoning (VidR): perceptual and temporal reasoning
  • Knowledge-enhanced Video Perception (KnowVid): domain-grounded object/term recognition
  • Knowledge-Intensive Video Reasoning (KnowVidR): multi-hop inference integrating visual evidence with external knowledge

2. Human-in-the-Loop Skill-Oriented Example Generation

The annotation process employs a skill-oriented, human-in-the-loop pipeline to ensure increasing reasoning depth and high-quality CoT rationales:

  • Seed Construction: 1,800 expert-curated seed examples (150 per skill per discipline), each thoroughly reviewed and supplemented with detailed CoT traces.
  • Scaled Example Generation: For each video, frames are uniformly sampled at 0.2 fps. Frontier MLLMs such as GPT-5.2 and Claude-4.5-Sonnet are prompted with three in-domain, in-skill seed examples and a target knowledge point to generate two QA examples per skill.
  • Example Validation and Filtering: Three major steps eliminate flawed examples:
    • Self-Consistency Verification (model must reproduce original answer given generated question and frames)
    • Video Dependency Filtering (using InternVL3.5-38B and Qwen3-VL-32B-Instruct, examples solved with single-frame context are discarded)
    • CoT Rationale Validation (an independent verifier MLLM checks step-level evidence against the video and standard knowledge; unsupported reasoning is pruned)

Skill stratification (from VidR to KnowVidR) and human oversight in final model selection (≤3% error rate on a 100-instance pilot) mitigate over-reliance on specific model biases, fostering dataset diversity and reliability.

3. Annotation Format, CoT Rationales, and Quality Assurance

Annotation includes both multiple-choice and open-ended QA pairs. VideoKR-SFT examples retain validated CoT rationales, formatted as:

1
2
<think> … step-by-step reasoning grounded in frames and knowledge … </think>
<answer> … final answer … </answer>

VideoKR-RL examples retain only Q/A pairs for RL-style reward calculation. Quality control comprises:

  • Manual audit of 800 random SFT examples (52 flagged as non-visual solvable; 32 CoT trace errors—17 affecting answers, 15 with unsupported claims), with error rates equivalent to initial seed curation.
  • Decontamination to ensure no overlap between train/eval sets, involving the removal of 131 videos with duplicate evaluation YouTube IDs plus 877 near-duplicates detected with frame-level perceptual hashing.

4. VideoKR-Eval: Knowledge- and Reasoning-Intensive Benchmark

VideoKR-Eval was designed to remedy the shortcut solvability present in prior benchmarks (VideoMMMU, MMVU, SciVideoBench), which exhibited single-frame answerability rates above 35%.

Construction proceeds as follows:

  1. Multi-Model Single-Frame Filtering: Of 2,900 initial examples, only those unsolved by three state-of-the-art models (Qwen3-VL-235B, Claude-4.5-Sonnet, GPT-5.2) after three single-frame trials are retained (1,254 examples).
  2. Expert Re-annotation: For 1,646 further filtered items, domain experts annotate 746 new, visually grounded QA pairs requiring continuous video analysis.

The final VideoKR-Eval consists of 2,000 examples (multiple-choice and open-ended). Single-frame answerability is reduced to approximately 10% (versus 35–49% in legacy datasets), resulting in an evaluation set that sharply distinguishes models requiring true video-temporal reasoning and knowledge-intensive inference.

5. Training and Evaluation Methodology

Training on VideoKR follows a two-stage pipeline:

  1. Supervised Fine-Tuning (SFT): One epoch on VideoKR-SFT-201K (batch size 32, learning rate 1e-5) with loss

LSFT=E(v,q,a,r)[logπθ(a,rv,q)]L_{\mathrm{SFT}} = -\mathbb{E}_{(v, q, a, r)}[\log \pi_\theta(a, r | v, q)]

  1. Guided Reinforcement Policy Optimization (GRPO): One epoch on VideoKR-RL-114K (batch size 32, learning rate 5e-6), maximizing

maxθ Eτπθ[R(τ)]βKL[πθπref]\max_\theta~\mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] - \beta \mathrm{KL}[\pi_\theta \| \pi_{\text{ref}}]

where R=0.1Rformat+0.9RaccuracyR=0.1 \cdot R_\text{format} + 0.9 \cdot R_\text{accuracy}, β=0.01\beta=0.01, and πref\pi_{\text{ref}} is the SFT checkpoint.

Evaluation spans seven benchmarks—Video-MME, MVBench, LongVideoBench (general); VideoMMMU, MMVU, SciVideoBench, and VideoKR-Eval (knowledge-intensive)—with standardized prompting (temperature 0.1, three runs, mean reported) utilizing the LMMs-Eval framework.

6. Empirical Results and Data Characteristics

Post-training with VideoKR under the SFT+RL regimen yields the following for Qwen2.5-VL-7B-Instruct:

  • Knowledge-intensive average: from 41.9% to 46.6% (+4.7), including +4.8 (MMVU) and +8.5 (VideoKR-Eval)
  • General video reasoning remains competitive: from 64.1% to 65.5%

Performance monotonically increases with the number of input frames, for example: on knowledge-intensive tasks, accuracy increases from 44.2% (16 frames) to 46.6% (128 frames).

Ablation results indicate:

  • Skill Composition: VidR only yields 41.4% (know-intensive), whereas VidR+KnowVid+KnowVidR yields 42.4%
  • CoT Supervision: direct output supervision achieves 39.4%, while CoT yields 42.4% (+3.0 gain)
  • Corpus Comparison: Of SFT-only runs, only VideoKR-SFT improves over the base (42.4% vs. 41.9%); prior corpora including Video-R1, OneThinker, and VideoRFT are ≤39%. RL-only: VideoKR-RL achieves 43.0%, slightly above VideoAuto-R1 at 42.7%

Zero-shot accuracy of Qwen2.5-VL‐7B on 3,000 random QA examples is 45.8–57.1% for other corpora but 39.2% for VideoKR, demonstrating a more challenging distribution that more effectively drives post-training improvements.

7. Significance and Implications

VideoKR’s combination of expert-guided, multi-stage example generation, progressive skill stratification, and stringent validation yields a challenging, diverse corpus for knowledge- and reasoning-intensive video understanding. Its integration of chain-of-thought rationales, minimized susceptibility to shortcut exploitation, and effective empirical gains over prior datasets indicate that data design—specifically, alignment of annotation format, validation procedures, and stratified skill coverage—constitutes a key driver of progress in video reasoning architectures. The significant reduction in single-frame answerability in VideoKR-Eval compared to legacy benchmarks further underscores its contribution to robust evaluation of video reasoning and temporal understanding capabilities in advanced MLLMs (Fu et al., 3 Jun 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VideoKR Corpus.