EditVerseBench: A Video Editing Benchmark

Updated 25 September 2025

EditVerseBench is a systematically built benchmark offering 20 diverse instruction-guided video editing tasks with balanced resolution splits.
It integrates both automated and VLM-based evaluation methods to comprehensively assess edit quality, text-video alignment, and temporal consistency.
Its balanced dataset and diverse categorical coverage enhance model robustness for real-world, instruction-driven video editing applications.

EditVerseBench is a systematically constructed, instruction-based benchmark for video editing that enables rigorous, multi-dimensional evaluation of models’ abilities to perform a wide spectrum of real-world video editing operations guided by natural language instructions. Developed in conjunction with the EditVerse unified framework, EditVerseBench is notable for its categorical breadth, aspect-ratio diversity, and integration of both automated and VLM-based assessment—positioning it as a reference benchmark for research and development in multimodal generative AI, especially for instruction-guided video editing and generation (Ju et al., 24 Sep 2025).

1. Objectives and Motivation

EditVerseBench was conceived to address the limitations of earlier evaluation datasets for video editing, which were restricted in terms of resolution (typically only supporting square videos), scope of editing tasks, and granularity of evaluation. The benchmark is expressly designed for instruction-powered video editing across a comprehensive set of real-world tasks, explicitly covering 20 video editing categories. These span object addition/removal, style transfer, camera movement, mask-based operations, and other practical edit types routinely encountered in both consumer and professional settings.

By structuring tasks as pairs of input video and textual instruction, EditVerseBench focuses on the capability of models to interpret and execute complex, multi-step or compositional edits via open-ended instructions, thereby measuring both generative and understanding capacities.

2. Dataset Composition and Structure

EditVerseBench comprises 100 curated video samples sourced from free stock video libraries, balancing diversity and editability. The dataset is evenly split between two resolutions: 50 horizontal (landscape) and 50 vertical (portrait) videos, ensuring models are evaluated for robustness across aspect ratios relevant to contemporary user content (such as social vs. cinematic formats).

Each video is paired with two unique editing instructions drawn from the 20 edit categories, resulting in a total of 200 evaluation instances. Sampling ensures that each edit type is represented by five horizontal and five vertical examples, mitigating sampling bias and covering a broad range of content, scene types, and motion dynamics.

The categorical coverage includes, but is not limited to:

Object removal/addition
Style transfer and aesthetic modification
Camera movement editing
Mask detection and propagation
Content replacement
Scene and attribute manipulations

3. Evaluation Metrics and Methodologies

EditVerseBench deploys a multi-faceted evaluation suite, integrating both automated signal-based metrics and large-scale vision–LLM (VLM) scorers:

a. Vision–LLM (VLM) Evaluation

A state-of-the-art VLM, such as GPT‑4o, serves as an automated judge. For each edit instance, three frames are clipped from the source and edited videos. The VLM receives the instruction, source frame, and edited frame, and is prompted to score adherence to the instruction, quality of the edit, and background or scene consistency. This setup emulates human evaluative reasoning while providing scalability.

b. Video Quality Metrics

Frame-wise PickScore is used as a proxy for perceptual quality and prompt relevance, with prior evidence indicating its correlation with human aesthetic judgments.

c. Text–Video Alignment

Alignment is quantified at two levels:

Frame-wise: Cosine similarity between CLIP text and image embeddings for each frame.
Video-level: Global video-text alignment leveraging ViCLIP, encoding the entire edited clip against the instruction embedding to capture higher-order compositional fidelity.

d. Temporal Consistency

Smoothness and coherence between adjacent frames are scored by:

CLIP feature-based cosine similarity across frames.
DINO feature-based temporal/structural coherence checks.

These metrics are computed for each evaluation pair and can be aggregated (e.g., mean or sum) for overall performance rankings. While explicit LaTeX formulas for these aggregations are not stipulated, the metrics are mathematically and reproducibly defined as means over framewise or video-level embedding similarities.

4. Contributions and Distinguishing Features

EditVerseBench introduces several key advances over prior benchmarks:

Breadth of Tasks: By spanning 20 categories of editing operations, EditVerseBench supports comprehensive assessment of generative model versatility and instruction comprehension.
Aspect Ratio Generalization: Deliberate inclusion of both horizontal and vertical resolutions reflects the diversity of modern video content, avoiding the square-video constraint of earlier datasets.
Task Sampling Balance: Fixed sampling per category and orientation ensures equal representation and statistical reliability across the benchmark.
Evaluation Paradigm: Integration of VLM-based scoring with objective signal-driven metrics allows for robust measurement along dimensions relevant to end-user satisfaction, model fidelity, and practical usability.

Property	EditVerseBench	Prior Benchmarks
Categories	20 editing types	<10 (often <5)
Aspect Ratio Diversity	Horizontal + Vertical	Square only
Evaluation	VLM-based + CLIP/ViCLIP + PickScore + Temporal Cons.	FVD, LPIPS, PSNR, etc.
Coverage per Category	Balanced (5 hor.+5 vert./category)	Unbalanced

5. Significance in Unified Video Editing Research

EditVerseBench plays a foundational role in supporting the EditVerse framework’s generalization, serving as a real-world testbed for instruction-driven video editing models capable of multimodal operation. Its structure facilitates:

Cross-modal transfer assessment, as EditVerse unifies text, images, and videos as a token sequence processed by self-attention.
Benchmarking of emergent editing capabilities across both generative and compositional editing tasks.
Fair and systematic comparison of models, both open-source and commercial, in scenarios reflective of practical content creation and manipulation challenges.

The nature of the benchmark ensures that improvements observed on EditVerseBench can be attributed to genuine advances in understanding and executing diverse editing instructions rather than overfitting to narrow, oversampled edit or resolution spaces.

6. Practical Implications and Future Directions

By integrating real-world edit types, balanced resolution formats, and both objective and instruction-aligned evaluation metrics, EditVerseBench provides a reliable benchmark for:

Comparative benchmarking of instruction-guided video editing and generation models.
Training and validation in multi-format, multi-task video editing environments.
Research on transfer learning across image and video modalities using unified attention architectures.

Future expansions may plausibly include more granular edit types, higher-resolution or multi-aspect-ratio videos, progressive difficulty gradations, and integration with other multimodal benchmarks (such as VE-Bench (Sun et al., 21 Aug 2024) and VEU-Bench (Li et al., 24 Apr 2025)), thereby strengthening the ecosystem of unified generative model evaluation.

7. Context within the Ecosystem of Editing Benchmarks

EditVerseBench stands in contrast to earlier benchmarks such as EditBench (Wang et al., 2022), which target text-guided image inpainting, and emerging benchmarks like VE-Bench (Sun et al., 21 Aug 2024) and VEU-Bench (Li et al., 24 Apr 2025), which focus on subjective-aligned video editing quality and deep video editing understanding, respectively. EditVerseBench’s unique combination of instruction diversity, resolution coverage, and VLM-based, multifaceted automatic evaluation distinguishes it as a reference for instruction-based video editing evaluation, underpinning unified model architectures that can robustly handle a gamut of video and image generation/editing tasks.