- The paper presents a novel benchmark that employs prompt-specific Likert-scale evaluations and a VLM auditor to score videos across 16 dimensions.
- It demonstrates high alignment with human ratings by replicating tier-ranking structures and highlighting gaps in physical and temporal consistency.
- The framework offers a robust, scalable approach that enhances evaluation granularity and informs practical model selection in generative video research.
WorldJen: Multi-Dimensional Benchmarking of Generative Video Models
Problem Statement and Motivation
The evaluation of generative video models is a longstanding and unresolved challenge in computer vision and AI. Existing reference-based metrics such as SSIM and PSNR are restricted to pixel-level fidelity and fail to capture higher-level semantic and physical correctness. Distributional metrics like FVD emphasize style components and overlook physical plausibility. Previous benchmarks, notably VBench and its successors, utilize binary VQA, but suffer from biases, resolution downsampling, and single-dimension specialization, resulting in poor granularity and substantial cost for evaluating new models. There is an industry-wide need for a benchmark that delivers human-aligned, granular, multi-dimensional evaluation with practical scalability.
WorldJen Benchmarking Framework
WorldJen introduces an end-to-end benchmarking pipeline that directly addresses the shortcomings of prior metrics and frameworks. The approach is founded on two core pillars: prompt-specific Likert-scale questionnaires across 16 evaluation dimensions and automated scoring via a vision-LLM (VLM) at full video resolution. The framework is structured in two distinct phases:
- Phase A (Prompt Curation):
Prompts authored by humans are curated from the VidProM corpus (~1.7M prompts) using a filtering regime emphasizing entropy, coverage, and complexity. 3,754 prompts are retained after deduplication, safety review, and suitability/difficulty scoring, enhanced using LLMs where necessary to ensure challenging multi-dimensional coverage.
- Phase B (Evaluation Engine):
For each video generated by a model under a curated prompt, 10 Likert-scale VQA questions per applicable dimension are generated and scored by a VLM auditor. Scores are aggregated using the Bradley-Terry (BT) model for global ranking and Predicted Human Alignment Score (PHAS), which calibrates against human preferences using weighted dimension averages.
Human Preference Study
A blind pairwise human preference study serves as the empirical ground truth anchor. Seven annotators from diverse domains assessed 300 videos (six models × 50 prompts), yielding 2,696 pairwise comparisons, forced-choice with confidence weighting. Anonymization and interface design ensured bias mitigation. BT ratings extracted from human votes revealed a consistent three-tier structure:
- Top tier: Veo 3.1. Fast, Kling v2.6. Pro
- Mid tier: Wan v2.2 A14B, LTX-2, Hunyuan v1.5
- Bottom tier: Wan 2.1 1.3B
These clusters exhibited statistically significant separation, with inter-annotator agreement at 66.9% and perfect test-retest self-consistency from the lead annotator.
VLM-Auditor Pipeline and Robustness
WorldJen's VLM evaluation employing Gemini 3. Flash demonstrates perfect tier-level concordance with human BT rankings (Spearman ρ=1.0, p=0.0014, n=6 models), confirmed through dense scoring (16 dimensions × 10 questions × 50 prompts). Dimension-wise analysis identifies persistent gaps; physical mechanics and inertial consistency are consistently rated near the scale midpoint (≤3.45/5), underscoring a widespread deficit in physics modeling. Aesthetics (color harmony, semantic drift) are highly saturated across all models (≥4.6/5).
Cross-method validation using reference-free semantic similarity (Gemini Embedding 2) aligns strongly with VLM-derived semantic adherence scores (Spearman ρ=0.943).
Ablation and Auditor Diversity
Ablations demonstrate the framework's robustness:
- VLM Diversity: Closed-source (Gemini 3. Flash, Claude Sonnet), and open-source (Gemma 4) auditors, all replicate the three-tier structure, with minor scale offsets. Dimensional rank agreement is highest in semantic and physics axes, weakest in temporal/spatial dimensions.
- Prompt Enhancement: LLM-based prompt enhancement increases difficulty and discrimination without altering rank order.
- Scoring Stability: The system is stable to number of questions per dimension (Q≥3 suffices), and benchmark scale (≥50 prompts ensures reliable rank recovery).
- Variance and Reliability: Within-run and between-run variance is small, with negligible impact on global ranking. The main source is partial prompt adherence, not VLM hallucination.
Comparison with VBench and Other Metrics
Direct comparison with VBench confirms that WorldJen's granular VLM-based scoring delivers much greater discrimination power and human-alignment. VBench's reference-free metrics saturate near ceiling, with minimal inter-model spread (<3% across most dimensions), and fail to reliably reproduce human tier boundaries. The resolution bottleneck (VBench downsamples to 224×224 for feature extraction) further impairs detection of fine-grained artifacts. WorldJen avoids these pitfalls by maintaining native video resolution and leveraging dimension-aware prompt/question design.
Practical and Theoretical Implications
WorldJen's demonstrated reliability and discrimination capacity have immediate implications for both academic evaluation and production model selection. The framework's scalability reduces generation requirements and evaluation cost by testing multiple dimensions per prompt. It exposes critical compositional and physical limitations in current models, suggesting that further research is needed in incorporating explicit physical simulators, inductive biases for causality, and improved temporal representations.
The use of VLM-as-a-judge provides a scalable proxy for human evaluation, enabling continuous benchmarking and checkpoint selection during model training. The Predicted Human Alignment Score (PHAS) complements the BT rating as an interpretable aggregate metric. Future directions include extending to conditional modalities (image/audio-to-video), video editing, and world model/action recognition benchmarking.
Conclusion
WorldJen presents a comprehensive, reproducible, and human-aligned multi-dimensional benchmark for generative video models (2605.03475). The pipeline's Likert-scale VQA and high-resolution VLM grading accurately replicates human tier structures and exposes genuine quality gaps in current systems, notably in physics and temporal consistency. The framework’s robustness to auditor swap, prompt enhancement, and question count makes it suitable for large-scale benchmarking and model selection in both academic and industrial contexts. Released datasets and code facilitate community adoption and further refinement. The results indicate that granularity, prompt complexity, and dense scoring are essential for meaningful video generation benchmarking, and that vision-LLMs offer a viable path for scalable, near-human evaluation.