VideoKR: Benchmark for Video Knowledge & Reasoning
- VideoKR is a large-scale corpus designed for knowledge- and reasoning-intensive video understanding, integrating both perceptual and domain-specific reasoning.
- It utilizes expert-curated, CC-licensed videos and a structured, multi-skill question generation process to drive advanced, multi-hop inference across disciplines.
- The accompanying VideoKR-Eval benchmark enforces continuous video evidence and domain knowledge, reducing reliance on single-frame or textual shortcuts.
VideoKR is a large-scale corpus and benchmark for knowledge- and reasoning-intensive video understanding, introduced to address a specific failure mode of contemporary video-LLMs: strong performance on perception-heavy video tasks but weaker performance when a question requires combining visible evidence with external domain knowledge and multi-step reasoning. It comprises 315,537 video QA examples over 145K newly collected, CC-licensed, expert-domain videos, and is paired with VideoKR-Eval, a benchmark designed to require genuine continuous video understanding rather than textual or single-frame shortcuts. Its central thesis is data-centric: improvements in advanced video reasoning can be driven by carefully designed training data and evaluation, even under a standard SFTGRPO post-training pipeline (Fu et al., 3 Jun 2026).
1. Problem definition and conceptual scope
VideoKR targets what the source paper calls knowledge- and reasoning-intensive video understanding. The target problem is not limited to recognizing actions or events in short clips. Rather, it includes questions that may require scientific principles, medical knowledge, engineering concepts, or other subject-matter expertise, together with multi-hop inference over temporally distributed evidence. The paper describes examples such as estimating product yield from observed chemical reactants, identifying medical diagnoses from visual symptoms and procedures, and recognizing domain-specific instruments and their function in a procedure (Fu et al., 3 Jun 2026).
The motivating diagnosis is twofold. First, existing post-training corpora for video understanding are described as being dominated by short, older, mostly perception-oriented videos and examples. Second, even recent “reasoning” datasets are said to remain vulnerable to shallow solution strategies, including textual shortcuts, single-frame answerability, and generator-specific bias. Against this background, VideoKR is presented as the first large-scale training corpus explicitly designed for knowledge- and reasoning-intensive video understanding (Fu et al., 3 Jun 2026).
Several design properties distinguish it from prior post-training corpora listed in the paper. It is 100% video; its videos are newly collected rather than inherited from existing benchmarks; they are all CC-licensed; they cover expert-domain content; and they are substantially longer on average, with mean length reported as 344.1 seconds, compared with approximately 24.7–90.9 seconds for the listed alternatives. The examples are also newly generated through expert-validated model selection from a pool of seven frontier models rather than through a single generator (Fu et al., 3 Jun 2026).
A common misconception is to treat VideoKR as simply another generic “video reasoning” dataset. The paper argues otherwise. Its examples are explicitly designed to couple three dimensions—perception, knowledge, and reasoning—and its evaluation protocol is explicitly constructed to suppress single-frame and text-only shortcuts. Another misconception would be to read it as a long-context lecture corpus; however, the collection pipeline excludes videos longer than 30 minutes, and the paper explicitly states that long-context video understanding beyond that regime is outside its scope (Fu et al., 3 Jun 2026).
2. Corpus construction and knowledge hierarchy
The dataset is built on a knowledge-driven collection process rather than on arbitrary web-scale video harvesting. The authors manually review undergraduate curricula from top universities and define 82 representative subjects across four major disciplines: Natural Sciences (20), Engineering (20), Healthcare (18), and Humanities and Social Sciences (24). These subjects are arranged in a four-layer hierarchy:
From this process, the authors produce 63,745 knowledge points, each represented as a term with a paragraph-length definition (Fu et al., 3 Jun 2026).
The video search process is scenario-based rather than keyword-based. For each knowledge point, the pipeline generates 1–3 realistic scenarios in which the concept is manifested in the world, then turns these scenarios into search keywords. Using the YouTube Data API, it retrieves the top 10 candidate videos for each query. Filtering then retains only Creative Commons licensed videos and excludes videos longer than 30 minutes. Metadata is screened for relevance; downloaded videos are checked for visual relevance by multimodal models; and safety filtering is performed by sampling four frames per video and applying Azure AI image moderation. This yields 146,567 CC-licensed videos before later decontamination (Fu et al., 3 Jun 2026).
Human supervision is integral throughout the pipeline. The work involves 34 domain experts, all with graduate-level backgrounds in the relevant disciplines. They contribute to knowledge-bank creation, seed-example curation, model validation, quality assessment, and benchmark annotation. For each core skill in each discipline, experts create 150 examples per skill per discipline, totaling 1,800 expert-curated seed examples across 4 disciplines and 3 skills. Every example is reviewed manually by the authors and then independently reviewed by a second annotator; 74 examples are revised during this secondary review stage (Fu et al., 3 Jun 2026).
The resulting corpus is not a simple aggregation of videos and questions. It is a knowledge-structured resource whose acquisition process is designed to bias collection toward real-world manifestations of expert concepts rather than lecture-style exposition. This suggests that VideoKR is intended to reduce the gap between academic benchmark video content and operational video reasoning demands in professional domains.
3. Skill-oriented example generation and validation
VideoKR decomposes advanced video understanding into three core skills. Basic Video Reasoning (VidR) covers direct comprehension of observable events, actions, spatial relations, and temporal order without requiring external domain knowledge. Knowledge-enhanced Video Perception (KnowVid) covers visual perception enriched by explicit domain knowledge, such as recognizing a burette or condenser and understanding its role in a chemistry procedure. Knowledge-Intensive Video Reasoning (KnowVidR) covers joint visual grounding, domain knowledge, and multi-hop inference, such as diagnosis from clinical video evidence or estimating chemical output from observed conditions (Fu et al., 3 Jun 2026).
The large-scale generation procedure is explicitly skill-oriented. For each video, the generation pipeline produces two examples per skill, for six examples per video in total, via six independent generation rounds. In each round, the model receives: video frames uniformly sampled at 0.2 fps with timestamps, three randomly sampled human-curated examples from the same discipline and skill category, and the associated knowledge point and subject information for KnowVid or KnowVidR (Fu et al., 3 Jun 2026).
The paper emphasizes multiple validation stages. First, a self-consistency verification pass re-prompts the same model with the generated question and the video frames to produce a fresh step-by-step answer; an example is kept only if the re-derived answer matches the original answer. Second, video dependency filtering attempts to remove examples answerable from text and sparse visual cues: InternVL3.5-38B and Qwen3-VL-32B-Instruct are given only the text plus four randomly sampled video frames, and if both models answer correctly, the example is removed. Third, an independent strong multimodal model validates the chain-of-thought rationale, checking that each key reasoning step is supported by observable evidence or standard domain knowledge and that the reasoning distinguishes the correct answer from plausible alternatives (Fu et al., 3 Jun 2026).
Model selection within the generation pipeline is also controlled. The paper uses a pool of seven frontier models—GPT-5.2, GPT-5-mini, Claude-4.5-Sonnet, Gemini-3-Flash, DeepSeek-V3.2, Qwen3-VL-235B-A22B, and GLM-4.6V—but treats them as stage-specific candidates rather than universally interchangeable tools. For each model and stage, experts label errors on 100 real inputs, and a model is eligible for that stage only if its total error rate is at most 3% (Fu et al., 3 Jun 2026).
Decontamination is performed after construction. YouTube-ID filtering removes 131 videos whose IDs match evaluation videos, and near-duplicate video filtering removes 877 videos using frame-level perceptual hashing and sequence matching. The final 315,537 examples are split by video into VideoKR-SFT-201K, which retains validated rationales, and VideoKR-RL-114K, which retains only question and verifiable answer (Fu et al., 3 Jun 2026).
The paper’s own audit recognizes residual noise. In a manual assessment of 800 random examples from VideoKR-SFT-201K, 52 questions are flagged as potentially non-visual-solvable, and 32 reasoning-trace errors are found, including 17 that change the final answer and 15 that preserve the final answer but use unsupported domain claims or insufficiently grounded reasoning. The dataset is therefore not presented as noise-free; rather, its error level is argued to be acceptable for large-scale construction (Fu et al., 3 Jun 2026).
4. VideoKR-Eval and the rejection of shortcut benchmarks
VideoKR-Eval is designed as a benchmark for continuous video understanding plus knowledge-intensive reasoning. Its immediate motivation is an audit of existing benchmarks—VideoMMMU, MMVU, and SciVideoBench—through a single-frame answerability test. In that audit, models are given only the question, answer options, and one random video frame, repeated across three independent trials. The reported single-frame answerability rates are high for prior benchmarks and much lower for VideoKR-Eval: 35.3%, 39.3%, 38.3% on VideoMMMU; 41.3%, 45.2%, 49.7% on MMVU; 21.8%, 13.2%, 23.0% on SciVideoBench; and only 9.5%, 10.1%, 10.7% on VideoKR-Eval, depending on model (Fu et al., 3 Jun 2026).
VideoKR-Eval is built from the same three source benchmarks but restructured through multi-model filtering and expert reannotation. Each original example is tested with Qwen3-VL-235B-A22B, Claude-4.5-Sonnet, and GPT-5.2, each under three independent single-frame trials. An example is retained in original form only if all three models fail to solve it consistently from one random frame. This preserves 1,254 original examples. For examples outside that intersection, the original QA pair is discarded and domain experts create new ones from the same videos, yielding 746 expert-reannotated examples. The final benchmark therefore contains 2,000 examples in total (Fu et al., 3 Jun 2026).
The benchmark composition is summarized below.
| Source | Retained original | Reannotated | Final |
|---|---|---|---|
| MMVU | 361 | 398 | 759 |
| VideoMMMU | 340 | 241 | 581 |
| SciVideoBench | 553 | 107 | 660 |
The retained and rewritten questions are required to be grounded in clearly observable video evidence, to require relevant domain knowledge, and to have uniquely determined ground-truth answers. This makes VideoKR-Eval not merely a filtered subset of prior benchmarks but a partially reconstructed one whose validity criterion is failure under sparse visual probing (Fu et al., 3 Jun 2026).
This benchmark design clarifies the paper’s notion of “video-intensive” evaluation. It does not equate difficulty with open-endedness or expert subject matter alone. Instead, it operationalizes difficulty as the inability to solve the example from text and isolated static evidence. A plausible implication is that VideoKR-Eval is intended less as a broad popularity benchmark than as a stress test for temporally grounded, knowledge-conditioned reasoning.
5. Training recipe, empirical results, and ablations
A notable feature of the work is methodological restraint. Rather than proposing a new optimization algorithm, the authors use a standard SFTGRPO post-training recipe on two open-source bases: Qwen2.5-VL-7B-Instruct and Qwen3-VL-8B-Instruct. Supervised fine-tuning is performed for one epoch on VideoKR-SFT-201K, followed by one epoch of GRPO on VideoKR-RL-114K. For Qwen3-VL-8B-Instruct, the paper also tests a Zero-RL condition in which GRPO is applied directly to the base model without prior SFT (Fu et al., 3 Jun 2026).
The reinforcement signal is intentionally simple. The paper defines the aggregate reward as
where is the format reward and is the accuracy reward. The format reward is
if and only if the output strictly matches the required structure
> ...<answer>...</answer>.
For , ROUGE is used for open-ended QA and Exact Match for multiple-choice QA (Fu et al., 3 Jun 2026).
The central empirical result is that VideoKR improves knowledge-intensive performance more strongly than general video reasoning. On Qwen2.5-VL-7B-Instruct with 128 frames, the general average improves from 64.1 to 65.5, the knowledge-intensive average improves from 41.9 to 46.6, and VideoKR-Eval improves from 32.7 to 41.2. On Qwen3-VL-8B-Instruct with 128 frames, the knowledge-intensive average improves from 48.5 to 51.5, and VideoKR-Eval improves from 39.0 to 45.3 (Fu et al., 3 Jun 2026).
A concise comparison of the headline 128-frame results is useful.
| Base model | Knowledge avg. before | Knowledge avg. after | VideoKR-Eval before | VideoKR-Eval after |
|---|---|---|---|---|
| Qwen2.5-VL-7B-Instruct | 41.9 | 46.6 | 32.7 | 41.2 |
| Qwen3-VL-8B-Instruct | 48.5 | 51.5 | 39.0 | 45.3 |
The ablations reinforce the paper’s data-centric argument. In skill composition experiments with 80K-example SFT subsets, the full VidR + KnowVid + KnowVidR mixture produces the strongest knowledge-intensive result, with VideoKR-Eval 36.8 and knowledge avg. 42.4, versus 35.3 / 41.4 for VidR only. In the CoT supervision ablation, Direct Output yields general avg. 61.4 and knowledge avg. 39.4, whereas Chain-of-Thought yields general avg. 58.3 and knowledge avg. 42.4. The paper therefore concludes that rationales hurt some general performance but improve harder knowledge-intensive reasoning (Fu et al., 3 Jun 2026).
Comparisons with prior post-training corpora are particularly pointed. In a controlled 80K-example SFT comparison, only VideoKR-SFT-201K exceeds the base model’s knowledge-intensive average, reaching 42.4, whereas Video-R1-CoT-165k, OneThinker-SFT-340k, and VideoRFT-CoT-102K reach 36.2, 38.3, and 38.4, respectively. In the RL-only comparison with 50K QA examples, VideoKR-RL-114K again yields the strongest result, with VideoKR-Eval 34.5 and knowledge avg. 43.0 (Fu et al., 3 Jun 2026).
The paper also reports a difficulty analysis showing that modern base models solve VideoKR examples less easily than examples from prior corpora. For Qwen3-VL-8B-Instruct, accuracy on sampled training examples is 57.1 on Video-R1, 51.1 on VideoRFT, 49.1 on OneThinker, 54.5 on VideoAuto-R1, but only 42.3 on VideoKR. The same pattern holds for Qwen2.5-VL-7B-Instruct, for which VideoKR is again the hardest sampled corpus. This supports the claim that prior corpora are partially saturated for current frontier base models (Fu et al., 3 Jun 2026).
6. Relation to adjacent research areas
VideoKR is primarily a data and benchmark intervention, not an end-to-end retrieval system, a structured extraction framework, or a video knowledge graph. Its role becomes clearer when contrasted with adjacent lines of work.
Interactive retrieval systems such as diveXplore 6.0 focus on rapid user-driven search over large multimodal corpora through shot search and map search, with interfaces designed for Known Item Search and Ad-hoc Video Search under time pressure (Leibetseder et al., 28 Aug 2025). VCR: Video representation for Contextual Retrieval instead emphasizes a textualized fusion of multimodal evidence—ASR, OCR, and frame captions—embedded into a semantic vector space and explored through a Topics-Map (Nir et al., 2024). VKIE addresses a different layer of the stack again: it formulates Video Key Information Extraction as frame-wise extraction of hierarchical information from visual text through BTC, ER, and EL, thereby producing structured OCR-derived metadata for downstream indexing and retrieval (An et al., 2023).
A separate strand concerns symbolic representation. VHAKG constructs a queryable RDF-based multi-modal knowledge graph for synchronized multi-view daily activity videos, integrating event-centric structure, frame references, 2D bounding boxes, and embedded video media (Egami et al., 2024). Long-context reasoning architectures such as Kwai Keye-VL-2.0-30B-A3B attack another complementary problem: maintaining access to hour-level video context through GQA-compatible DeepSeek Sparse Attention, with strong results on long-video comprehension and temporal grounding (Team et al., 9 Jun 2026).
Taken together, these systems suggest a layered technical landscape. VideoKR supplies the post-training data regime and evaluation regime for models that must reason over expert-domain video; interactive systems such as diveXplore and VCR supply user-facing retrieval and exploration mechanisms; VKIE supplies structured visual-text extraction; VHAKG supplies graph-based symbolic grounding; and long-context models such as Keye-VL-2.0 supply scalable inference over lengthy multimodal evidence. This suggests that VideoKR is best understood not as a replacement for those components, but as a training and benchmarking substrate that could complement them in full video knowledge and reasoning systems.
Within that broader landscape, VideoKR’s distinctive contribution is its insistence that benchmark validity, example difficulty, domain richness, temporal dependence, and rationale quality are central variables in video reasoning progress. Its strongest empirical claim is therefore not architectural superiority, but the proposition that better-targeted data can materially change what post-trained video models are able to do (Fu et al., 3 Jun 2026).