Zero-Shot Benchmarking of Video Models
- The paper introduces a zero-shot evaluation protocol for generative video models, revealing their capacity for out-of-the-box generalization without task-specific fine-tuning.
- Methodologies employ perturb-and-track, rollout-based prediction, and prompted video reasoning with standardized metrics like EPE, MAD, and DSC.
- Results demonstrate that architectures with factorized spatio-temporal latent structures and random-access decoding outperform alternatives across diverse tasks.
A zero-shot benchmark for generative video models refers to an evaluation protocol in which models are assessed on tasks or domains unseen during training, with no additional fine-tuning or task-specific supervision. This paradigm probes the emergent capabilities, architectural constraints, and application limits of modern generative video models under challenging generalization settings. Recent work has formalized robust zero-shot benchmarks across domains such as motion estimation, predictive display, navigation, classification, foundation vision tasks, and education.
1. Definition and Scope of Zero-Shot Benchmarking
Zero-shot benchmarking assesses generative video models, typically large self-supervised or text-conditioned models, for their ability to solve tasks "out-of-the-box"—i.e., purely by prompt, perturbation, or input manipulation, without task-specific fine-tuning, additional labeled data, or any gradient updates. Evaluation is conducted on tasks for which the model was not directly optimized, revealing the degree to which pretraining and model capacity enable generalization and composition (Kim et al., 11 Jul 2025, Khalil et al., 10 May 2026, Wiedemer et al., 24 Sep 2025, Huang et al., 10 Feb 2026, Lai et al., 11 Oct 2025, Lee et al., 26 May 2026, Zhang et al., 2018).
Benchmarked tasks include:
- Optical flow extraction from video (Kim et al., 11 Jul 2025)
- Short-horizon predictive display for teleoperation (Khalil et al., 10 May 2026)
- 3D navigation/planning from video (Huang et al., 10 Feb 2026)
- General video reasoning (perception, manipulation, multimodal reasoning) (Wiedemer et al., 24 Sep 2025)
- Medical imaging tasks (segmentation, denoising, motion prediction) (Lai et al., 11 Oct 2025)
- Educational adequacy and classroom safety (Lee et al., 26 May 2026)
- Zero-shot video classification (via GAN-based feature synthesis) (Zhang et al., 2018)
Zero-shot protocols are strictly enforced: models are frozen, no task-specific gradients are applied, and all evaluation occurs on tasks/splits not seen during training.
2. Benchmark Methodologies and Protocol Design
Benchmark design is anchored by strict task splits, input/output protocols, and standardized metrics:
- Perturb-and-Track Protocols: A small localized perturbation (e.g., Gaussian "white bump") is injected into an input video frame; the model predicts both perturbed and clean futures. The difference—quantified by either RGB difference or the KL divergence between predicted distributions—identifies per-pixel motion, enabling zero-shot optical flow extraction with no fine-tuning or flow labeling (Kim et al., 11 Jul 2025).
- Rollout-Based Future-Frame Prediction: In predictive display, models are evaluated by their ability to autoregress multiple future frames conditioned on preceding context (e.g., 9 past frames), with accuracy measured across an 8-step horizon at multiple resolutions. Key metrics: Mean Absolute Difference (MAD), per-frame inference latency, rollout latency, GPU memory, and divergence or error drift over time (Khalil et al., 10 May 2026).
- Prompted Video Reasoning and Planning: For navigation or reasoning, a generative model is prompted by a single image and text instruction, unrolling a "dreamed" look-ahead trajectory. Multiple sampled rollouts are scored by a pretrained Vision-LLM (VLM) for coherence, instruction alignment, and safety, with downstream inverse dynamics models translating video into control waypoints (Huang et al., 10 Feb 2026).
- Educational and Multimodal Safety Judgement: Video outputs are scored on composite pedagogical rubrics (e.g., Knowledge–Skills–Attitude), with explicit refusal gates on unsafe or inappropriate prompts. Multiple human raters and VLM-based auto-judging establish reliability and inter-model ranking (Lee et al., 26 May 2026).
- Zero-Shot Classification via Generative Feature Synthesis: GAN-driven frameworks synthesize video features for unseen classes using semantic priors (e.g., GloVe), converting ZSL into a supervised task without explicit mapping projection (Zhang et al., 2018).
3. Architectural Requirements and Model Comparisons
Zero-shot performance is highly sensitive to generative model architecture. Empirical analysis across diverse architectures (latent regressors, diffusion, raster-order autoregressors, local random-access decoders) distills three critical properties for high-fidelity zero-shot reasoning (Kim et al., 11 Jul 2025):
- Distributional Future Prediction: Generating full predictive distributions (not single-point estimates), which reduces blurring and mean-regression collapse in outputs.
- Factorized Spatio-Temporal Latent Structure: Ensuring localized, patch-wise latents (avoiding global bottlenecks), which enables local input perturbations to propagate accurately.
- Random-Access Decoding: Supporting arbitrary, partial-frame conditioning—crucial for counterfactual tracking and fine-grained query answering.
Among surveyed models, only the Local Random Access Sequence (LRAS) architecture natively satisfies all three requirements, enabling state-of-the-art zero-shot optical flow estimation via KL-tracing (Kim et al., 11 Jul 2025). Other leading architectures (e.g., transformer-based latent diffusion as in LTX, diffusion-based like SVD, autoregressive Cosmos) exhibit characteristic failure modes when lacking these properties (Khalil et al., 10 May 2026, Wiedemer et al., 24 Sep 2025).
Comparative Results for Select Benchmarks
| Model/Task | Best Zero-Shot Result | Baseline (Prior/Supervised) |
|---|---|---|
| LRAS+KL-tracing/Flow | 16.6% EPE reduction (DAVIS) | RAFT, SEA-RAFT (supervised) |
| LTX-2B/PredictiveDisplay | MAD=11.55 (512×320), slow | SVD faster, but error ≫ LTX; none real-time |
| NavDreamer Navigation | 87% task success (Wan 2.6) | 33–53% (open-source backbones) |
| Veo 3/Reasoning Tasks | pass@1 up to 0.76 | Veo 2 baseline 0.10 |
| LVM/Medical Motion | DSC=95.15% | 89.16% (RMSim), 88.14% (ConvLSTM) |
| EduVideoBench KSA | 0.45 (Wan 2.6), 0.38 (Sora) | None exceed 0.52 (classroom ready) |
4. Evaluation Metrics and Task Suites
Benchmarks apply discipline-appropriate quantitative metrics, strict dataset splits, and qualitative human judgment where relevant:
- Optical Flow/Tracking: Endpoint error (EPE), Average Jaccard, occlusion accuracy (OA via KL threshold) (Kim et al., 11 Jul 2025).
- Predictive Display: Framewise and average MAD, temporal error evolution, runtime, and memory usage (Khalil et al., 10 May 2026).
- Navigation: Success rate, SPL (success weighted by path length), visual consistency, dynamic feasibility (Huang et al., 10 Feb 2026).
- Vision Reasoning: Task-specific metrics (e.g., mean IoU, OIS for edges, pass@k for counting/symmetry/maze) (Wiedemer et al., 24 Sep 2025).
- Medical Imaging: DSC, IoU, symmetric surface distance, 95th-percentile Hausdorff distance, PSNR, SSIM (Lai et al., 11 Oct 2025).
- Education: Composite Knowledge–Skills–Attitude (KSA) score with rubric sub-components and mandatory safety gating (A-NE refusal rate) (Lee et al., 26 May 2026).
- Zero-Shot Classification: Top-1 accuracy, mAP, generalized ZSL splits (Zhang et al., 2018).
A hallmark is that models are explicitly evaluated on domains or tasks not encountered during pretraining.
5. Analysis of Strengths, Pitfalls, and Generalization Gaps
Zero-shot frameworks reveal both the latent capabilities and the boundaries of contemporary generative video modeling:
- Successes:
- High-fidelity motion tracking, optical flow, and navigation emerge in models with appropriate architectural priors and training scale, even in real-world or synthetic benchmarks (e.g., LRAS, Wan 2.6) (Kim et al., 11 Jul 2025, Huang et al., 10 Feb 2026).
- Video models such as Veo 3 show emergent capacity for chain-of-frames reasoning, perception, manipulation, and certain visual logic tasks, indicating structural similarity with development trajectories of LLMs (Wiedemer et al., 24 Sep 2025).
- Medical video models transfer zero-shot to segmentation, motion, and low-level restoration with high anatomical and temporal coherence (Lai et al., 11 Oct 2025).
- Failure Modes:
- Models lacking factorization or random-access decoding (e.g., SVD, raster-order decoders) experience drift, blurred outputs, or inability to propagate local perturbations (Kim et al., 11 Jul 2025).
- In predictive display, no model achieves low error and real-time inference simultaneously; scaling model size does not guarantee improved short-horizon fidelity or latency (Khalil et al., 10 May 2026).
- In educational settings, absolute scores on pedagogical validity remain low, and models commonly fail CTML/CLT principles such as Modality and Signaling; comprehension gains from video over text are negligible (Lee et al., 26 May 2026).
- GAN-based zero-shot synthesis frameworks are sensitive to semantic embedding quality and display instability across random splits (Zhang et al., 2018).
Aggregate metrics can mask divergent error patterns, especially when error drifts or mode collapse manifest over multiple frames (Khalil et al., 10 May 2026).
6. Extensions, Limitations, and Future Benchmark Directions
Zero-shot benchmarks are expanding to include more diverse scenarios (beyond motion and prediction), and to set unified standards for foundation vision models:
- Beyond Flow/Prediction: Counterfactual prompting and tracing can generalize to extraction of depth, segmentation, and physical parameters by careful design of tracer probes and distributional queries (Kim et al., 11 Jul 2025).
- Towards Unified Foundation Models: Models like Veo 3 signal the shift toward unified, generalist video models that span perception, manipulation, and reasoning, analogous to the LLM revolution in NLP (Wiedemer et al., 24 Sep 2025).
- Pedagogical and Safety-Aware Benchmarks: EduVideoBench introduces a blueprint for multidimensional safety and validity in classroom-aligned video generation; recommendations include explicit dual-channel outputs, scene segmentation, and curriculum-aware prompt selection (Lee et al., 26 May 2026).
- Automated and Scalable Evaluation: VLM-based scoring enables automatic, scalable reward assignment; sampling-based policy evaluation increases robustness in stochastic planners (Huang et al., 10 Feb 2026).
- Technical Barriers:
- Inference speed, memory, and computational cost remain substantial barriers for real-time-deployable generative video benchmarks. Aggressive inference optimization, distillation, quantization, and model pruning are active research directions (Khalil et al., 10 May 2026).
- Standardization of datasets, public splits, robustness metrics, and semantic calibration methods are required for cross-benchmark comparability (Zhang et al., 2018).
A plausible implication is that zero-shot protocols—if paired with architecture-driven inductive priors and large-scale self-supervised pretraining—will continue to unlock powerful cross-domain generalization in generative video models, but limitations in interpretability, safety, and practical deployment persist and require both technical and domain-specific advances.
7. Representative Datasets, Models, and Metrics (Summary Table)
| Domain/Benchmark | Model Examples | Datasets | Key Metric(s) |
|---|---|---|---|
| Optical Flow | LRAS, SVD, Cosmos, CWM | TAP-Vid DAVIS/Kubric | EPE, KL, AJ, OA |
| Predictive Display | LTX (2B/13B), SVD, Wan VACE/I2V | CARLA simulator, MILE, Roach | MAD, runtime, VRAM |
| Navigation | Wan 2.6, Cosmos 2.5, Hunyuan 1.5 | Real/sim indoor/outdoor flights | Task Succ., SPL |
| Vision Reasoning | Veo 3, Nano Banana | BIPEDv2, LVIS, Emu-edit, Mazes | OIS, mIoU, pass@k |
| Medical Imaging | LVM | 4D/3D CT (multiple cohorts) | DSC, IoU, SD, HD95 |
| Education | Wan 2.6, Sora 2, Veo 3.1, Kling | EduVideoBench (Korean curriculum) | KSA score, A-NE |
| Classification (GAN) | cGAN (Zhang & Peng) | HMDB51, UCF101, CCV, Olympic | Top-1 Acc., mAP |
References
- Taming generative video models for zero-shot optical flow extraction (Kim et al., 11 Jul 2025)
- Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models (Khalil et al., 10 May 2026)
- NavDreamer: Video Models as Zero-Shot 3D Navigators (Huang et al., 10 Feb 2026)
- Video models are zero-shot learners and reasoners (Wiedemer et al., 24 Sep 2025)
- Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging? (Lai et al., 11 Oct 2025)
- Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench (Lee et al., 26 May 2026)
- Visual Data Synthesis via GAN for Zero-Shot Video Classification (Zhang et al., 2018)