CoVEBench: Can Video Editing Models Handle Complex Instructions?

Published 7 Jun 2026 in cs.CV and cs.AI | (2606.08415v2)

Abstract: While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces CoVEBench, a novel benchmark that dissects compositional video editing into atomic operations to reveal model limitations.
It employs a detailed evaluation protocol with seven editing axes and multi-metric scoring to assess instruction compliance, video quality, and fidelity.
The results show closed-source models outperform open-source ones while highlighting trade-offs between executing edits and preserving non-target content.

CoVEBench: Diagnosing the Compositional Limits of Instruction-Guided Video Editing Models

Motivation and Problem Formulation

Instruction-driven video editing models have achieved strong outcomes in isolated tasks such as style transfer, object addition/removal, and background substitution. However, real-world user requests are overwhelmingly compositional, involving multiple simultaneous and interdependent edits—such as subject swap, camera motion alteration, and object insertion—while strictly enforcing the preservation of unrelated regions. Prevailing benchmarks rely on overly simplified, single-point prompts and holistic similarity metrics (e.g., CLIP or reward models) that lack the granularity to diagnose compositional capabilities or localized failure modes. This misalignment prevents the identification of key obstacles in current models and hinders progress toward practical, user-aligned video editing workflows.

Benchmark Construction: Design, Taxonomy, and Evaluation

CoVEBench addresses these limitations through a comprehensive benchmark and evaluation protocol centered on compositional instruction-guided video editing. The dataset contains 416 carefully filtered and reviewed source videos, 626 compositional editing instructions (each averaging approximately three atomic edit operations), and 9,990 fine-grained checklist items for diagnostic evaluation. Editing dimensions are systematically categorized into seven axes: subject, background, camera, style, motion, position, and special effects, supporting a combinatorial taxonomy that robustly reflects authentic video editing tasks.

Instructions are generated via advanced MLLMs (e.g., GPT-5, Gemini-3.1-Pro, Qwen3-VL-plus, Doubao-Seed-1.8) with dynamic pooling and human curation. Fine-grained checklists, constructed using LLMs and subject to rigorous manual refinement, decompose compositional instructions into targeted queries that probe not only execution of atomic edits but also physical logic and strict semantic preservation. The resulting instruction pool has extremely low pairwise TF-IDF similarity, ensuring broad semantic coverage and reduced redundancy.

Evaluation Protocol: Metrics and Rationale

CoVEBench employs a tripartite evaluation axis:

Instruction Compliance: Quantified via MLLM-judged checklists, comprising Instruction Following Score (IFS), Video Realism Score (VRS), and a union metric (UAS), which only credits an edit if both execution and realism requirements are met.
Video Quality: Assessed using VisualQuality-R1 (VQR), Aesthetic Predictor (AES), DOVER++ (TQ), and temporal stability metrics (MSM).
Video Fidelity: Measured through SSIM, CoTracker-derived motion fidelity (MF), Static Region Consistency (SRC), and MLLM-based semantic preservation (SEM).

Automated and MLLM-based scoring is aligned with human judgments (93% agreement, Cohen's K = 0.84). Metrics are complementary with low cross-dimension redundancy, probing orthogonal aspects of model performance.

Experimental Insights: Performance Analysis and Model Limitations

Ten contemporary models (spanning leading open- and closed-source architectures) are benchmarked. Notable findings include:

Closed-source models consistently outperform open-source ones in instruction compliance, establishing an upper bound on current public research.
Union accuracy (UAS) lags behind instruction-following or realism scores individually, highlighting the interplay and mutual interference among compositional edits and the challenge of strict physical realism maintenance.
Editing complexity directly degrades performance: increasing the number or compositionality of edit points and instruction length yields a pronounced accuracy decline. Even the strongest models exhibit a dramatic drop in robustness for temporally extended or semantically dense instructions.
There exists a trade-off between edit execution and preservation: aggressive editing tends to degrade the integrity of preserved regions. Conversely, models that favor content preservation may undershoot instruction execution, omitting required operations.
Aggregate scores obscure fine-grained weaknesses: Camera, motion, and subject edits (especially replacements) have systematically lower accuracy compared to style or background operations. Typical failures include missed edits, spatial entanglement, physical violation (e.g., implausible kinematics), and visual artifacts or hallucinations.

Sequentially decomposing multi-point instructions into atomic calls underperforms joint compositional inference both in edit success and preservation, due to error accumulation and insufficient prior-edit preservation.

Efficiency, Robustness, and Scalability

Resource profiling reveals substantial variance in inference time and VRAM footprint. Some models (e.g., Kiwi, Lucy) approach practical latency thresholds, suitable for deployment, whereas others (e.g., ICVE, VACE) exhibit excessive resource requirements, undermining practical utility. All models display diminished performance as generated frame count and source video length increases—the temporal scalability problem remains active.

Implications, Limitations, and Future Outlook

CoVEBench exposes fundamental weaknesses in current video editing approaches: failure to independently compose spatially and temporally entangled operations, poor physical reasoning, and limited robustness as instruction complexity increases. These deficiencies underscore gaps in multi-modal representation, spatiotemporal compositionality, and physical understanding that are not apparent under non-compositional, single-edit evaluation.

Critically, CoVEBench’s exhaustive, atomic evaluation framework—unlike holistic or reward-based benchmarks—enables targeted diagnosis of model failures and progress tracking as compositional skills improve.

However, the benchmark’s scope remains limited to text-guided prompts; real-world creation often leverages multi-modal control inputs (reference images, masks, audio cues). Furthermore, CoVEBench is an evaluation resource; it does not supply a paired large-scale training set for compositional editing and does not propose a new modeling architecture or agent capable of tackling the compositional regime identified.

Future advancements in video editing models will require new architectures and training paradigms emphasizing modular compositionality, causal disentanglement, robust spatiotemporal reasoning, and interpretable edit traceability. The diagnostic paradigm established by CoVEBench provides a precise research target and will likely guide model selection, ablation, and progress tracking for the next generation of instruction-aligned video generation systems.

Conclusion

CoVEBench provides the first rigorous, large-scale, fine-grained benchmark for compositional video editing. It diagnoses the core limitations in current video editing models—ranging from edit omission and physical implausibility to poor preservation of non-target content—through an interpretable, targeted checklist protocol. This resource is poised to become essential for driving progress in the design and evaluation of AI systems capable of complex, user-aligned video editing, and sets a high bar for genuine agentic video understanding and modification.

Markdown Report Issue