UniVA-Bench: Agentic Video AI Benchmark
- UniVA-Bench is an open-source benchmark suite designed to evaluate holistic video AI agents that execute multi-step workflows involving planning, generation, editing, segmentation, and memory management.
- It features two evaluation tracks—Functional Modules and Agentic Probing—that assess both task-specific outputs and cognitive planning metrics such as wPED, DepCov, and ReplanQ.
- UniVA-Bench addresses the evaluation gap in single-task benchmarks by simulating real-world, iterative video production workflows with traceable goal cards and curated datasets.
UniVA-Bench is an open-source benchmark suite specifically designed to evaluate agentic, multi-step video AI systems concerned with end-to-end workflows rather than isolated, single-task capabilities. Developed in conjunction with the UniVA framework, UniVA-Bench emphasizes the assessment of holistic agent behavior, including planning, orchestration, adaptability, and memory utilization across heterogeneous and compositional video tasks. Its mission is to fill the evaluation gap left by existing single-task benchmarks by simulating real-world, iterative video production workflows that require agents to chain understanding, generation, editing, segmentation, and composition steps with full traceability and context preservation (Liang et al., 11 Nov 2025).
1. Benchmark Scope and Motivation
UniVA-Bench targets comprehensive evaluation of video AI agents by focusing on the integration of diverse functional capacities in multi-step, agentic workflows. Key goals are:
- Holistic Evaluation beyond per-task accuracy, probing the ability to coordinate and recover across interdependent video tasks.
- Multi-modal, Multi-step Workflows that mirror real video production, employing structured "goal cards" containing gold-standard artifacts such as storyboards, per-frame masks, and QA pairs to define complex, chained objectives.
- Agentic and Functional Probing through metrics not only covering output quality but also the quality of action planning (weighted Plan Edit Distance, wPED), logical dependency resolution (DepCov), and robustness to re-planning under simulated failures (ReplanQ), alongside analyses of different memory layers (global, task, user). This approach contrasts with traditional datasets (e.g., DAVIS, UCF101, text-to-video benchmarks) that are limited to isolated objective measurement and do not measure planning, adaptation, or continuity in agentic systems.
2. Task Taxonomy and Functional Tracks
UniVA-Bench encompasses two orthogonal evaluation tracks: (1) Functional Modules, targeting core video tasks; and (2) Agentic Probing, evaluating the agent's cognitive and procedural capacities.
Functional Modules
- Understanding (Long-Video QA): Systems answer 10 interdependent semantic and aesthetic questions for a single long-form video (60–120 s), in one inference.
- Generation: Includes:
- LongText2Video: Generation from a long, noisy textual prompt (100–200 words), requiring storyboard-level planning.
- Entities2Video: Narrative video generation from 1–3 reference images and a rewritten prompt, with strict entity preservation.
- Video2Video: Modification of a source video according to textual instruction, covering story alignment (style change), style alignment (story change), and semantic alignment (both).
- Editing (Long Video): Multi-step edits across 30–60 s clips, demanding cross-shot modifications while maintaining narrative integration.
- Segmentation (Long Video): Temporally consistent object segmentation across concatenated, occlusion-prone clips from DAVIS-2017.
Agentic Probing
- Storyboard to Execution Planning: Evaluates whether agents accurately translate storyboards into executable action sequences.
- Pipeline Failure Recovery: Measures planning quality (wPED), logical coverage (DepCov), and recoverability (ReplanQ) under controlled tool outages.
- Memory Analysis: Probes the impact of global (expert trajectories), user (reference retrieval), and task (storyboard) memories on planning robustness.
3. Dataset Composition and Structure
UniVA-Bench's datasets are curated for rigorous qualitative assessment rather than scale. Each functional module uses a small, well-defined evaluation set without train/validation splits:
- Understanding: 10 long-form videos from Video-MME, each with 10 QA pairs.
- Generation: 10 long/noisy prompts for LongText2Video derived from manually created storyboards; 10 Entities2Video points from Opens2v-nexus, with revised prompts; 10 Video2Video instances per subtype from SF20k, with hand-authored instructions.
- Editing: 10 long videos from SF20k, each with expert-constructed multi-step prompts.
- Segmentation: 10 composite clips drawn from DAVIS-2017, highlighting entity occlusions.
- Probing sets: 50 planning tasks and approximately 20 pipeline-failure tasks per metric.
All video assets are standardized to 480p resolution at 24 fps, delivered as referenced artifacts in JSON-formatted goal cards. Evaluation operates exclusively on these held-out sets.
4. Evaluation Protocols and Metrics
UniVA-Bench employs a multi-tier evaluation apparatus, distinguishing between task-specific outputs and agentic procedures.
Task-Specific Quality Metrics
- Understanding: Normalized QA accuracy, computed as the fraction of correct answers per 10-question vector.
- Generation & Editing:
- CLIP Score: The average CLIP-ViT similarity between prompt/storyboard captions and randomly sampled video frames.
- DINO Score: Average cosine similarity between reference entity DINO features and frames.
- MLLM Preference: Pairwise output evaluation via large vision-LLMs (e.g., InternVL-3-78B, Gemini-2.5-Pro).
- Segmentation:
- J-mean (), F-mean (boundary F-score), J & F-mean, defined per frame:
MLLM-as-a-Judge Evaluation
- Structured scoring across six axes: semantic accuracy, spatial relations, behavior, attributes, style, and overall consistency (each 1–5), with aggregated preference rates.
Agentic Planning Metrics
- wPED: Weighted Plan Edit Distance.
- DepCov: Dependency coverage.
- ReplanQ: Re-planning quality.
where is the Levenshtein edit distance over tool-name sequences.
5. Baseline Comparisons and Results
UniVA-Bench's evaluation protocol permits direct comparison between generalist, agentic architectures and single-purpose task models. For each core task, UniVA and baselines such as LTX-Video, Wan, Seedance, GPT-4o, Gemini-2.5 Pro, InternVL3-38B, Qwen2.5-VL-72B, VACE, and SA2VA were benchmarked. Highlights are as follows:
| Task | Metric | Best Baseline | UniVA Result | Qualitative Note |
|---|---|---|---|---|
| LongText2Video | CLIP/MLLM Pref | 0.2161/2.650 | 0.2814/3.333 | UniVA leads in coherence and MLLM preference |
| Entities2Video | CLIP | Slightly higher | Lower | UniVA prioritizes narrative coherence over pixel similarity |
| Video2Video | MLLM Pref | 2.621 | 4.068 | UniVA favored for holistic intent fulfillment, even at modest cost to frame similarity |
| Understanding | QA Accuracy | 0.75 (InternVL3-38B) | 0.76 | Plan-based decomposition advantageous |
| Editing | CLIP/DINO/MLLM | 0.2258/0.6808/3.484 | 0.2280/0.7488/3.635 | Integrated reasoning-editing synergy |
| Segmentation | J & F-mean | 0.1524 (SA2VA) | 0.2467 | Improved occlusion handling via dynamic semantic queries |
Plan-Act agentic probing doubles planning success rates (45% vs. 20% for single-agent) and more than doubles wPED (0.117 vs. 0.050). Memory ablations demonstrate global memory prevents catastrophic failures; user memory aids identity alignment; and storyboard memory boosts semantic coherence.
6. Usage and Artifacts
UniVA-Bench is fully open-sourced (http://univa.online/) with the codebase at https://github.com/UniVA-video/UniVA-Bench. Artifacts include:
- JSON goal cards specifying task definitions and references to standardized .mp4 files.
- Python evaluation scripts for CLIP, DINO, J/F measures, and LLM-judge evaluation.
- APIs for metric computation and plan-tool simulation are provided in the repository documentation.
The procedure for local benchmarking:
- Clone UniVA-Bench.
- Install dependencies (Python 3.10, PyTorch 2.0, HuggingFace Transformers, FFmpeg).
- Download video artifacts with the provided script.
- Launch evaluation with:
No training or validation splits are used; evaluation exclusively targets the provided curated test sets.1 2 3 4
python bench.py \ --tasks all \ --models univa,ltx,wan,seedance \ --output results.json
7. Limitations and Prospective Directions
Current limitations of UniVA-Bench include limited dataset scale (10–20 curated instances per module), which constrains statistical significance for reporting; incomplete coverage of modalities such as audio composition and subtitle generation; reliance on LLM-generated artifacts (storyboards and prompts) rather than fully human-annotated references; and omission of low-level fidelity metrics (e.g., FVD, PSNR, SSIM). Planned extensions entail expanding functional and probing modules to hundreds or thousands of workflows, deepening modality coverage, incorporating human-authored references for ground truth, and broader support for evaluation metrics and emerging video-LLMs.
UniVA-Bench sets a foundation for systematic evaluation of generalist, agentic video AI. By integrating traditional output quality measures with sophisticated planning and memory metrics, it provides a comprehensive platform for assessing next-generation, context-aware, compositional video systems (Liang et al., 11 Nov 2025).