Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

VIUBench: Video Intrinsic Understanding Benchmark

Updated 12 November 2025
  • VIUBench is a benchmark that rigorously evaluates video understanding in Multimodal LLMs through self-supervised, intrinsic tasks.
  • It employs three tasks—anomaly grounding, object counting, and temporal jigsaw—to probe spatial, temporal, and fine-grained reasoning capabilities.
  • The parameterizable task difficulty and automatic ground truth generation provide robust evaluations while exposing key diagnostic weaknesses in current models.

The Video Intrinsic Understanding Benchmark (VIUBench) is designed to rigorously evaluate core perceptual and reasoning abilities in Multimodal LLMs (MLLMs) by probing their understanding of the intrinsic structure and content of videos. By leveraging self-supervised pretext tasks that require no human annotation, VIUBench offers a systematic framework for quantifying models’ capabilities in fine-grained, spatial, and temporal video understanding. This approach directly addresses blind spots in existing video benchmarks, providing robust, verifiable, and parametric evaluation.

1. Core Pretext Tasks and Motivations

VIUBench consists of three self-supervised intrinsic tasks, each targeting a distinct competency in video understanding. These tasks generate verifiable question–answer pairs directly from unmodified video data without the need for manual labeling, and their difficulty is tunable via described parameters.

  • Anomaly Grounding A temporal segment of a video VV is perturbated using a transformation (e.g., channel swap, 180º rotation, zoom out, mirror flip, or intra-segment shuffle). The model must predict the anomalous interval’s start and end times (ts,te)(t_s, t_e). This task evaluates the detection and localization of violations in natural video dynamics, spanning fine-grained (e.g., color-channel swaps), spatial (rotation, zoom, mirror), and temporal (frame shuffling) axes.
  • Object Counting Synthetic primitive shapes (circle, rectangle, triangle) are procedurally generated with randomized visual features and overlaid on a random subset of frames. The model must report the exact global shape counts (N1,,Nk)(N_1, \ldots, N_k). Two difficulty levels are specified by the maximum number of modified frames (≤3 or ≤4) and the instances per frame (≤3 or ≤4). This probes fine-grained object detection and counting under controlled conditions.
  • Temporal Jigsaw The video is divided into nn equal-length segments, randomly permuted, and presented to the model, which must recover the original segment sequence. Difficulty is tuned by segment count (n=6n=6 "easy", n=8n=8 "hard"). This assesses temporal coherence, event progression, and causal inference.

All ground-truth answers are available by construction. The parameterizable task difficulty ensures VIUBench continues to challenge even the most advanced MLLMs.

2. Dataset Design and Composition

VIUBench sources its video content from the public Llava-Video collection, representing a broad mix of everyday and instructional content. The benchmark comprises 2,700 items, distributed approximately equally across the three intrinsic tasks, with per-task subdivisions for different perturbations or difficulty levels. Videos cover household scenes, sports, instructional content, and general activities.

VIUBench is strictly intended as a held-out evaluation set. There are no train/validation/test splits, precluding direct model tuning on the benchmark. All items are generated through randomized sampling strategies:

  • For anomaly grounding, temporal intervals within videos are uniformly sampled for perturbation.
  • Object counting overlays shapes on random frame subsets.
  • Temporal jigsaw applies independently sampled permutations.

3. Self-Supervision and Ground-Truth Generation

Task-specific protocols ensure that every benchmark sample has an unambiguous, procedurally generated ground truth:

  • Anomaly Grounding: For a video of duration DD, sample [ts,te][0,D][t_s, t_e] \subset [0, D]; apply perturbation P\mathcal{P} to all frames in [ts,te][t_s, t_e]; record the interval as ground truth.
  • Object Counting: Select frames FsubF_{\text{sub}}; on each frame fif_i, overlay OiO_i shapes. Ground truth counts are computed as Nk=fiFsub{oOi:type(o)=ck}N_k = \sum_{f_i \in F_{\text{sub}}} |\{o \in O_i: \text{type}(o) = c_k\}| for each shape type ckc_k.
  • Temporal Jigsaw: Partition the video into S1,...,SnS_1, ..., S_n; permute segments via π\pi; ground truth is the inverse permutation π1\pi^{-1}.

These procedures create large, diverse sets of automatically verifiable question–answer pairs without direct human annotation.

4. Evaluation Metrics and Formal Definitions

Each intrinsic task employs a precise metric:

Task Metric Scoring Scheme
Anomaly Grounding Mean Intersection over Union (mIoU) IoU(Tpred,Tgt)=TpredTgtTpredTgt\text{IoU}(T_{\text{pred}}, T_{\text{gt}}) = \frac{|T_{\text{pred}} \cap T_{\text{gt}}|}{|T_{\text{pred}} \cup T_{\text{gt}}|}
Object Counting Mean exact-match accuracy (per category) Acck=1\text{Acc}_k = 1 if y^k=yk\hat{y}_k = y_k, else $0$; mean over kk
Temporal Jigsaw Exact sequence match 1 if predicted sequence equals ground truth, 0 otherwise

For these core tasks, auxiliary ranking metrics such as precision, recall, or mAP are not used. This design ensures diagnostic clarity for each evaluated competency.

5. Baseline Model Performance and Failure Analysis

Baseline evaluation involves leading closed-source and open-source MLLMs:

  • Closed-source (GPT-5):

Average score ≈58.7%. Performance breakdown: Counting (Easy: 88.4%, Hard: 70.3%); strong on channel swap and rotation (up to ≈82%), but lower on zoom out (≈56.5%), mirror (≈48.9%), shuffle (≈34.1%); temporal jigsaw (Easy: 39.0%, Hard: 27.0%).

  • Open-source (Qwen3-VL-8B):

Average score ≈19.5%. Counting Hard: ≈7.7%; spatial anomalies: 13–53%; temporal jigsaw (Hard): near zero.

  • Random Guessing:

~16% overall, verifying model performance exceeds chance but is well below human-level proficiency.

Observed failure modes include bimodal outcome distributions: for anomaly grounding, predictions are often either nearly perfect (IoU ≈ 1) or fail entirely (IoU ≈ 0), suggesting reliance on superficial heuristics. Increasing task difficulty, such as more temporal segments in the jigsaw, reliably worsens performance, typically halving temporal jigsaw accuracy when moving from easy (n=6n=6) to hard (n=8n=8). Temporal tasks yield almost zero correct responses unless models are explicitly pre-trained for similar puzzles, in which case improvements are limited to the easier configuration.

6. Theoretical Foundations and Extensions via Temporal Certificate Sets

The broader theoretical framework underlying VIUBench is formalized in the “temporal certificate set” methodology applied in EgoSchema (Mangalam et al., 2023). A temporal certificate for a video-annotation pair is the minimal set of non-overlapping intervals whose union suffices for a human verifier to confirm the annotation’s correctness. The intrinsic temporal length L(V,A)L(V, A) is the sum of these intervals’ durations.

Empirical studies demonstrate that certificate length more accurately measures task-intrinsic temporal difficulty than clip duration. For instance, EgoSchema’s median certificate is ≈100 s, much longer than the next-closest benchmark (≈17.5 s, LVU), with most action and event-recognition datasets ranging between 1–10 s. This taxonomy of short (≈1 s), long (≈10 s), and very long-form (≈100 s) video tasks disambiguates the intrinsic demands of each dataset, irrespective of superficial clip length.

A plausible implication is that future VIUBench iterations or related benchmarks could adopt certificate-based difficulty metrics to calibrate task selection or reporting granularity across datasets and domains.

7. Diagnostic Power, Limitations, and Comparative Advances

VIUBench exposes major blind spots in current MLLMs, notably in temporal coherence reasoning and fine-grained perceptual discrimination. Its major diagnostic strengths include:

  • Task granularity and parametric scaling: Difficulty is tunable, allowing VIUBench to remain relevant as models improve.
  • No human annotation bias: All tasks are synthetic, verifiable, and free of language or question-style biases present in conventional video QA.
  • Dense, stable rewards: Intrinsic task design provides reliable feedback for RL-based frameworks such as VideoSSR.
  • Complementarity: VIUBench isolates “video-intrinsic” understanding, decoupling model evaluation from external world knowledge.

Identified limitations are the relatively modest scale (2,700 QA pairs), absence of standard splits (no train/val/test), possible incompleteness in synthetic perturbations (not covering lighting changes, complex camera movement), and dataset restrictions such as static 512×512 frame resolution and fixed frame count per video.

Compared to prior work, VIUBench reframes video understanding from a focus on external fact recall or agent-centric Q&A to an intrinsic probe of a model’s capacity for perceptual and causal reasoning over raw video, independent of annotation bias or expensive human curation.


VIUBench, through its triad of self-supervised pretext tasks and systematic, verifiable protocols, establishes a new paradigm for evaluating and diagnosing fundamental competencies in vision–LLMs. Its results indicate that even state-of-the-art models have substantial ground to cover in mastering the intrinsic structure and dynamics of video, with the benchmark serving as both an incisive diagnostic and a scalable generator of training signals for future model improvement (He et al., 9 Nov 2025, Mangalam et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Video Intrinsic Understanding Benchmark (VIUBench).