ViDiC-1K: Video Difference Captioning Dataset

Updated 4 December 2025

ViDiC-1K is a benchmark dataset that automates video difference captioning using an LLM-as-a-Judge protocol and binary checklists.
It utilizes a dual-checklist framework to separately assess similarity and difference detection, yielding precise, task-specific accuracy metrics.
The dataset supports advancements in content forensics and multimodal evaluation by providing reproducible, empirical measurements for video-language tasks.

The LLM-as-a-Judge protocol defines an automated and standardized approach for evaluating the descriptive accuracy of Multimodal LLMs (MLLMs) in comparative video understanding tasks. This protocol operationalizes checklist-based assessment, utilizing a LLM to simulate expert annotation, enabling scalable, high-fidelity benchmarking of natural language outputs against fine-grained compositional criteria. Its formalism, as realized in the ViDiC-1K dataset within the ViDiC: Video Difference Captioning framework, represents a paradigm shift from ad hoc human rating to reproducible, empirical measurement for difference captioning and related video-language tasks (Wu et al., 3 Dec 2025).

1. Protocol Overview and Rationale

The LLM-as-a-Judge protocol establishes a pipeline in which model-generated descriptions of video-pair comparisons are quantitatively benchmarked by querying a dedicated judge LLM—specifically, GPT-5-Mini—with a set of binary questions about compositional, spatial, and temporal differences and similarities (organized as a “dual-checklist”). The judge LLM produces answers based solely on the candidate's textual output, not by direct access to the raw videos or ground truth. These answers are then matched to meticulously human-validated ground-truth responses, yielding interpretable, task-specific accuracy metrics (Wu et al., 3 Dec 2025).

This approach addresses the need for reproducibility, objectivity, and granularity in the evaluation of MLLMs, particularly for tasks where free-form output must be systematically mapped to structured semantic differences encompassing attributes such as subject identity, scene style, action, and cinematography.

2. Dual-Checklist Design and Execution

Underpinning the protocol is the dual-checklist evaluation framework, which distinguishes between similarity and difference detection:

For each video pair $(A, B)$ , the system provides a checklist $Q$ —over 4,000 items encompassing seven major dimensions (subject, style, background, camera work, subject motion, positional relationship, and playback technique)—representing compositional aspects.
The candidate MLLM (model under test, denoted $M$ ) generates a free-form natural language description $D$ of the comparison.
The judge LLM $J$ is then prompted with $(D, Q)$ , and generates a structured answer set $A_J$ in JSON form with binary yes/no (plus rationale) responses, mirroring ground-truth annotation granularity.

Separate scoring for similarity and difference questions reduces confounds between overlooked similarities and omitted differences. Similarity items (e.g., “Are these attributes different?” GT=no) penalize only hallucinated distinctions; omissions are not penalized. Difference items (e.g., “Does B have X that A lacks?” GT=yes) require explicit positive mention of all true differences (Wu et al., 3 Dec 2025).

3. Formal Metrics

The protocol specifies precise mathematical notation for accuracy computation. Let $Q_{sim}$ and $Q_{diff}$ denote the ambiguity-suppressed sets of similarity and difference checklist questions, respectively. Accuracy metrics are defined as:

$A_{\mathrm{sim}} = \frac{1}{|Q_{\mathrm{sim}}|} \sum_{i\in Q_{\mathrm{sim}}} \mathbf{1}\bigl(A_{J,i} = A_{GT,i}\bigr)$

$A_{\mathrm{diff}} = \frac{1}{|Q_{\mathrm{diff}}|} \sum_{j\in Q_{\mathrm{diff}}} \mathbf{1}\bigl(A_{J,j} = A_{GT,j}\bigr)$

$A_{\mathrm{all}} = \frac{1}{|Q|}\sum_{k\in Q} \mathbf{1}(A_{J,k}=A_{GT,k})$

Here, $A_{J,i}$ is the answer given by the judge LLM and $A_{GT,i}$ is the human ground truth for checklist item $i$ . Agreement between GPT-5-Mini and human annotators reaches 95.2% overall (95.9% similarity, 94.97% difference), and repeated protocol runs are highly stable (variability <$0.6%$) (Wu et al., 3 Dec 2025).

4. Empirical Findings and System Performance

The protocol's deployment in ViDiC-1K spans evaluation of 19 models—including both closed-source (e.g., Gemini-2.5-Pro, GPT-5, GPT-4o) and open-source (Qwen3-VL, InternVL-3.5, LLaVA-v1.6-Vicuna) MLLMs. The protocol reveals that:

Top-performing models achieve approximately 66–67% (Gemini-2.5-Pro) and 63% (GPT-5) overall average accuracy, with similarity detection consistently exceeding difference detection (e.g., Gemini-2.5-Pro Sim 75.33% vs. Diff 63.73%).
Style variations are reliably detected (>75%) by several models, while camera work and playback technique remain challenging for all models (<50% average accuracy).
“Thinking” modes in some models (e.g., InternVL-3.5 Thinking) incrementally improve difference detection at the expense of increased hallucinations, lowering similarity scores.
Chronic error modes identified include hallucinated differences, self-contradictions, and omitted subtle temporal or compositional changes, particularly in camera and spatial relationship dimensions (Wu et al., 3 Dec 2025).

Model	Avg (%)	Diff (%)	Sim (%)
Gemini-2.5-Pro	66.72	63.73	75.33
GPT-5	62.94	57.32	79.17
Qwen3-VL-32B	61.38	58.54	71.50
LLaVA-v1.6-Vicuna	8.96	5.11	20.07

This tabulation illustrates the discriminative power and calibration challenges in current MLLMs under the LLM-as-a-Judge paradigm.

5. Protocol Limitations and Evaluation Challenges

The current deployment of the LLM-as-a-Judge protocol exhibits several constraints. The evaluation set scale (1,000 video pairs) suffices for fine-grained assessment but does not support large-scale training. The annotation process—combining expert human review and automated LLM-based drafting—entails significant cost and remains sensitive to subtle temporal ambiguities. The protocol’s reliance on binary checklist answers enforces objectivity but may trade off flexibility and open-ended linguistic richness, highlighting an inherent tension between checklist rigor and naturalistic output diversity (Wu et al., 3 Dec 2025).

A plausible implication is that future expansion may demand broader, instruction-style corpora that retain compositional coverage while enabling richer, more context-sensitive scoring.

6. Applications and Future Directions

The LLM-as-a-Judge protocol, as instantiated in the ViDiC-1K framework, has several identified use cases:

Content forensics, including detection of forgeries or video manipulations.
Verification of video editing integrity and alignment with prompts.
Automated change-log and event summary generation for post-production workflow.
Fine-grained automated feedback in sports analytics or rehabilitation monitoring.
Intelligent surveillance with semantic change detection and alerting for relevant activity (Wu et al., 3 Dec 2025).

Anticipated future directions include scaling datasets for instruction tuning, developing architectures with inductive biases for comparative spatio-temporal reasoning, and integrating multi-modal reasoning mechanisms that jointly optimize for both difference and similarity detection.

Continued use of the LLM-as-a-Judge protocol is expected to support advances in fine-grained video understanding and benchmarking best practices for multimodal intelligence.

PDF Markdown Chat (Pro)

References (1)

ViDiC: Video Difference Captioning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ViDiC-1K Dataset.