InsertBench MVI Benchmark

Updated 23 September 2025

InsertBench is a benchmark that defines mask-free video insertion with detailed scene and subject curation for precise model evaluation.
It leverages a multifaceted data pipeline—RealCapture, SynthGen, and SimInteract—to generate diverse and challenging test scenarios.
The framework employs rigorous metrics like CLIP-I, DINO-I, and ViCLIP-T to assess subject consistency, text-video alignment, and overall video quality.

InsertBench is a comprehensive benchmark introduced specifically to facilitate the evaluation of mask-free video insertion (MVI) models. Designed in response to a gap in quantitative, task-specific assessment protocols for video insertion, InsertBench covers a diverse array of scenes and reference subjects and is paired with a rigorous suite of evaluation metrics. Its introduction enables comparative assessment between emergent MVI methods—including diffusion-based architectures and transformer models—and state-of-the-art commercial solutions, addressing the technical challenges unique to the integration of arbitrary reference subjects into unconstrained video environments (Chen et al., 22 Sep 2025).

1. Benchmark Composition

InsertBench is comprised of 120 five-second videos (121 frames at 24 fps) that systematically span a broad spectrum of real-world and synthetic scenes. These include indoor spaces, outdoor landscapes, vehicular environments, high-dynamic human interaction scenarios, wearable camera sequences, animated contexts, and further niche sources. For each environment, corresponding reference subjects are meticulously selected based on their compatibility with each scene in terms of visual coherence, robustness to harmonization, and suitability for natural human or object-level interactions. The benchmark’s dataset diversity is amplified by InsertPipe, a data pipeline employing three data curation modalities:

RealCapture: Sampling from actual video footage;
SynthGen: Leveraging synthetic data and LLM-guided prompt engineering with text-to-image (T2I) or image-to-video (I2V) generative techniques;
SimInteract: Rendering interactions from digital asset libraries for fine-grained control and variety.

This multifaceted approach ensures InsertBench exposes underlying model capabilities across a wide variety of content and insertion difficulty levels.

2. Evaluation Metrics

Quantitative evaluation of MVI models on InsertBench employs multiple, complementary metrics:

Metric Category	Metric Name	Measurement Goal
Subject Consistency	CLIP-I, DINO-I	Similarity of generated subject to reference (per frame/region)
Subject Consistency	FaceSim	Facial/identity preservation (person-centric cases)
Text-Video Alignment	ViCLIP-T	Text prompt and video feature cosine similarity
Video Quality	Dynamic/Image/Aesthetic/Consistency	Sharpness, realism and motion coherence

For subject consistency, sampled frames from each video are evaluated by comparison of the synthesized subject regions against the original references using feature-space similarity. Text-video alignment is measured as the cosine similarity between features produced by ViCLIP-T from textual prompts and generated frames, ensuring instructional adherence. Video quality assessment considers spatial and temporal properties—sharpness, realism, motion continuity, and overall harmonization.

3. Comparison with Prior Benchmarks

InsertBench introduces several technical advances over prior benchmarks used for video modification and generation tasks:

Task Specificity: It is the first benchmark tailored to mask-free video insertion, not merely generic video editing or synthesis. Previous protocols lack dedicated mechanisms to penalize or reward precise subject-scene integration and subject consistency.
Diversity and Comprehensiveness: By systematically varying backgrounds, subject typologies, and prompt complexity, InsertBench offers an expanded spectrum of insertion contexts unmatched by existing collections, which generally lack fine-grained subject-prompt curation or scene-subject compatibility standards.
Balanced Evaluation: The assessment framework considers subject fidelity, background preservation, text prompt adherence, and multi-aspect video quality, as opposed to prior works that typically prioritize one or two axes of performance. This provides a more holistic indicator of model robustness in practical applications.

4. Performance Results on InsertBench

Evaluation of OmniInsert—an MVI model based on diffusion transformer architectures—on InsertBench demonstrates statistically significant improvements over state-of-the-art commercial solutions (including Pika-Pro and Kling):

Subject Consistency Scores: OmniInsert achieves higher fidelity, e.g., CLIP-I = 0.745 and DINO-I = 0.639, indicating closer visual correspondence between generated and reference subjects.
Text-Video Alignment: ViCLIP-T score of 25.945, higher than competing approaches, suggesting improved interpretability and controllability with respect to user-provided prompts.
Video Quality: Superior scores across sharpness, motion consistency, and aesthetic categories, accompanied by empirical evidence of fewer artifacts and more natural transitions.
User Studies: Human raters confirm quantitative findings, consistently preferring OmniInsert outputs for subject fidelity, rationality of insertion, and subjective visual quality.

This performance is primarily attributed to innovations in feature injection, progressive training, and specialized loss functions (described below).

5. Technical Underpinnings

The architectural and training pipeline innovations found in OmniInsert are reflected in how InsertBench is leveraged as a benchmarking standard:

A. Condition-Specific Feature Injection (CFI):

For background video, noisy and reference video latents are concatenated:

$z_t^{\text{(Vid)}} = \text{Concat}([z_t^T, z^S, f^S], \text{dim}=1)$

with $z_t^T = (1-t)z^T + t\cdot\epsilon$ , capturing stochastic variation and scene context.

For subject reference, temporal concatenation is performed:

$z_t^{\text{(Sub)}} = \text{Concat}([z_t^I, z^I, f^I], \text{dim}=1)$

where $z_t^I$ is a temporally noisy version of subject features, and $f^I$ is a region-of-interest mask.

B. Progressive Training Strategy:

Phase 1: Focused on isolated subject insertion.
Phase 2: Full MVI task with background video integration.
Phase 3: High-fidelity data refinement (e.g., for face/identity)
Phase 4: Insertion Preference Optimization (IPO), using human preference data for reinforcement of realism and harmonization.

C. Subject-Focused Loss:

A spatially-masked loss accentuates subject regions:

$L_{\text{SL}} = \mathbb{E}\left[\|M \cdot ((z_0 - \epsilon) - V_\theta(z_t, t, y))\|^2\right]$

where $M$ is a mask emphasizing subject locations, and $V_\theta$ is the model prediction. The total loss is:

$L = \lambda_1 L_{\text{FM}} + \lambda_2 L_{\text{SL}},$

with $L_{\text{FM}}$ as the standard flow matching loss.

6. Implications and Applications

InsertBench establishes a rigorous, reproducible environment for the evaluation of MVI models across both academic and commercial research. Its depth and diversity facilitate:

Model Stress Testing: By curating challenging subject-scene combinations and insertion scenarios, InsertBench exposes model weaknesses and guides iterative model development.
Comparative Advancement: The multi-metric, task-specific assessment framework sets unified standards for community benchmarks, enabling fair competition and direct comparability between architectural advances.
Generalization Studies: Diverse data sources (real, synthetic, interactive) allow detailed analysis of generalization, robustness to distribution shift, and adaptation to unseen environments and subjects.

A plausible implication is that InsertBench may stimulate the creation of more challenging, task-specific benchmarks for related generative modeling and video editing subfields. Its systematic approach and technical rigor raise the standard for empirical evaluation in video editing, generative modeling, and multimodal AI systems.

7. Conclusion

InsertBench constitutes a significant advancement in the quantitative evaluation of mask-free video insertion methods. Its multidimensional coverage—with well-defined scenes, curated reference subjects, and comprehensive evaluation criteria—serves both the model development and comparative assessment needs of the research community. The adoption of InsertBench has enabled the identification of clear strengths and limitations in both academic and commercial MVI systems, establishing a new baseline for future methodological progress (Chen et al., 22 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models (2025)

Follow Topic

Get notified by email when new papers are published related to InsertBench.