WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

Published 5 May 2026 in cs.CV | (2605.03475v1)

Abstract: Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Frechet Video Distance (FVD) favors distributional textures over physical plausibility. Binary Visual Question Answering (VQA) based benchmarks like VBench~2.0 are prone to yes-bias and rely on low-resolution auditors that miss temporal failures. Moreover, their prompts target a single dimension at a time, multiplying the number of videos required while still not guaranteeing reliable results. WorldJen addresses these limitations directly. Binary VQA is replaced with Likert-scale questionnaires graded by a VLM that receives frames at native video resolution. Video generation costs are addressed by using adversarially curated prompts that are designed to exercise up to 16 quality dimensions simultaneously. The framework is built around two interlocking contributions. First, A blind human preference study is conducted, accumulating (2,696 pairwise annotations from 7 annotators with 100% pair coverage over 50 of the curated prompts $\times$ 6 state-of-the-art video models. A mean inter-annotator agreement of 66.9% is achieved and the study establishes a human ground-truth Bradley-Terry (BT) rating with a three-tier structure. Second, A VLM-as-a-judge evaluation engine using prompt-specific, dimension-specific Likert questionnaires (10 questions per dimension, 47,160 scored responses) judges the videos and reproduces the human-established three-tier BT rating structure independently. The VLM achieves a Spearman $\hatρ=1.000,~p=0.0014$ that is interpreted as tier agreement with the human results. Six focused ablation studies validate the robustness of the VLM evaluation framework.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents a novel benchmark that employs prompt-specific Likert-scale evaluations and a VLM auditor to score videos across 16 dimensions.
It demonstrates high alignment with human ratings by replicating tier-ranking structures and highlighting gaps in physical and temporal consistency.
The framework offers a robust, scalable approach that enhances evaluation granularity and informs practical model selection in generative video research.

WorldJen: Multi-Dimensional Benchmarking of Generative Video Models

Problem Statement and Motivation

The evaluation of generative video models is a longstanding and unresolved challenge in computer vision and AI. Existing reference-based metrics such as SSIM and PSNR are restricted to pixel-level fidelity and fail to capture higher-level semantic and physical correctness. Distributional metrics like FVD emphasize style components and overlook physical plausibility. Previous benchmarks, notably VBench and its successors, utilize binary VQA, but suffer from biases, resolution downsampling, and single-dimension specialization, resulting in poor granularity and substantial cost for evaluating new models. There is an industry-wide need for a benchmark that delivers human-aligned, granular, multi-dimensional evaluation with practical scalability.

WorldJen Benchmarking Framework

WorldJen introduces an end-to-end benchmarking pipeline that directly addresses the shortcomings of prior metrics and frameworks. The approach is founded on two core pillars: prompt-specific Likert-scale questionnaires across 16 evaluation dimensions and automated scoring via a vision-LLM (VLM) at full video resolution. The framework is structured in two distinct phases:

Phase A (Prompt Curation):

Prompts authored by humans are curated from the VidProM corpus (~1.7M prompts) using a filtering regime emphasizing entropy, coverage, and complexity. 3,754 prompts are retained after deduplication, safety review, and suitability/difficulty scoring, enhanced using LLMs where necessary to ensure challenging multi-dimensional coverage.

Phase B (Evaluation Engine):

For each video generated by a model under a curated prompt, 10 Likert-scale VQA questions per applicable dimension are generated and scored by a VLM auditor. Scores are aggregated using the Bradley-Terry (BT) model for global ranking and Predicted Human Alignment Score (PHAS), which calibrates against human preferences using weighted dimension averages.

Human Preference Study

A blind pairwise human preference study serves as the empirical ground truth anchor. Seven annotators from diverse domains assessed 300 videos (six models × 50 prompts), yielding 2,696 pairwise comparisons, forced-choice with confidence weighting. Anonymization and interface design ensured bias mitigation. BT ratings extracted from human votes revealed a consistent three-tier structure:

Top tier: Veo 3.1. Fast, Kling v2.6. Pro
Mid tier: Wan v2.2 A14B, LTX-2, Hunyuan v1.5
Bottom tier: Wan 2.1 1.3B

These clusters exhibited statistically significant separation, with inter-annotator agreement at 66.9% and perfect test-retest self-consistency from the lead annotator.

VLM-Auditor Pipeline and Robustness

WorldJen's VLM evaluation employing Gemini 3. Flash demonstrates perfect tier-level concordance with human BT rankings (Spearman ρ=1.0, p=0.0014, n=6 models), confirmed through dense scoring (16 dimensions × 10 questions × 50 prompts). Dimension-wise analysis identifies persistent gaps; physical mechanics and inertial consistency are consistently rated near the scale midpoint (≤3.45/5), underscoring a widespread deficit in physics modeling. Aesthetics (color harmony, semantic drift) are highly saturated across all models (≥4.6/5).

Cross-method validation using reference-free semantic similarity (Gemini Embedding 2) aligns strongly with VLM-derived semantic adherence scores (Spearman ρ=0.943).

Ablation and Auditor Diversity

Ablations demonstrate the framework's robustness:

VLM Diversity: Closed-source (Gemini 3. Flash, Claude Sonnet), and open-source (Gemma 4) auditors, all replicate the three-tier structure, with minor scale offsets. Dimensional rank agreement is highest in semantic and physics axes, weakest in temporal/spatial dimensions.
Prompt Enhancement: LLM-based prompt enhancement increases difficulty and discrimination without altering rank order.
Scoring Stability: The system is stable to number of questions per dimension (Q≥3 suffices), and benchmark scale (≥50 prompts ensures reliable rank recovery).
Variance and Reliability: Within-run and between-run variance is small, with negligible impact on global ranking. The main source is partial prompt adherence, not VLM hallucination.

Comparison with VBench and Other Metrics

Direct comparison with VBench confirms that WorldJen's granular VLM-based scoring delivers much greater discrimination power and human-alignment. VBench's reference-free metrics saturate near ceiling, with minimal inter-model spread (<3% across most dimensions), and fail to reliably reproduce human tier boundaries. The resolution bottleneck (VBench downsamples to 224×224 for feature extraction) further impairs detection of fine-grained artifacts. WorldJen avoids these pitfalls by maintaining native video resolution and leveraging dimension-aware prompt/question design.

Practical and Theoretical Implications

WorldJen's demonstrated reliability and discrimination capacity have immediate implications for both academic evaluation and production model selection. The framework's scalability reduces generation requirements and evaluation cost by testing multiple dimensions per prompt. It exposes critical compositional and physical limitations in current models, suggesting that further research is needed in incorporating explicit physical simulators, inductive biases for causality, and improved temporal representations.

The use of VLM-as-a-judge provides a scalable proxy for human evaluation, enabling continuous benchmarking and checkpoint selection during model training. The Predicted Human Alignment Score (PHAS) complements the BT rating as an interpretable aggregate metric. Future directions include extending to conditional modalities (image/audio-to-video), video editing, and world model/action recognition benchmarking.

Conclusion

WorldJen presents a comprehensive, reproducible, and human-aligned multi-dimensional benchmark for generative video models (2605.03475). The pipeline's Likert-scale VQA and high-resolution VLM grading accurately replicates human tier structures and exposes genuine quality gaps in current systems, notably in physics and temporal consistency. The framework’s robustness to auditor swap, prompt enhancement, and question count makes it suitable for large-scale benchmarking and model selection in both academic and industrial contexts. Released datasets and code facilitate community adoption and further refinement. The results indicate that granularity, prompt complexity, and dense scoring are essential for meaningful video generation benchmarking, and that vision-LLMs offer a viable path for scalable, near-human evaluation.

Markdown Report Issue