OmniInsert: Mask-Free Video Insertion

Updated 23 September 2025

OmniInsert is a unified, mask-free video insertion framework that leverages diffusion transformer models to achieve consistent subject integration in videos.
It introduces innovative data pipelines—RealCapture, SynthGen, and SimInteract—to overcome data scarcity and synthesize diverse subject–scene pairs.
Its condition-specific feature injection and progressive training protocols ensure subject–scene equilibrium and high-quality, harmonized video editing validated on the InsertBench benchmark.

OmniInsert is a unified mask-free video insertion framework based on diffusion transformer models, designed to robustly insert one or more reference subjects into arbitrary source video without relying on segmentation masks. The system achieves high fidelity and coherence in video editing by architecting new data pipelines, differentiated feature injection, progressive optimization, and advanced prompt engineering. Key contributions include solving data scarcity with automated paired training data, maintaining subject-scene equilibrium through condition-specific feature injection, harmonizing insertion with targeted losses and preference optimization, and introducing a standardized benchmark for method evaluation.

1. Problem Definition and Motivations

OmniInsert addresses the Mask-free Video Insertion (MVI) task: placing a provided subject (image or multi-frame reference) into a source video such that the resultant sequence is visually coherent, preserves the detailed appearance and identity of the subject, and integrates naturally with the background. Previous approaches were limited by reliance on segmentation masks, scripted control signals, or failure to maintain consistent subject appearance across frames. Data scarcity and inconsistency in subject-background blending were further limiting factors in practical deployment. OmniInsert pursues three principal goals: overcoming the shortage of paired training data, maintaining subject–scene equilibrium, and producing harmonized insertions free of copy–paste artifacts (Chen et al., 22 Sep 2025).

To address the lack of suitable paired data, OmniInsert introduces InsertPipe—a compound data pipeline synthesizing diverse cross-pair video–subject datasets:

RealCapture Pipe segments long source clips into single-scene units via detection and tracking, erases subjects using video inpainting, and produces source-target pairs by cross-video subject swapping.
SynthGen Pipe utilizes LLMs for text prompt creation, text-to-image (T2I) and image-to-video (I2V) models to generate synthetic subject–scene combinations, and vision–LLM filtering for high-detail consistency.
SimInteract Pipe produces interaction-heavy scenes using Houdini rendering, asset libraries, and spatial layout priors from models such as SpatialLM, with animation from mocap trajectories.

This three-pronged approach yields a comprehensive training corpus with real, synthetic, and simulated interactive content, enabling both wide domain coverage and high subject–scene diversity.

Pipeline	Source Type	Purpose
RealCapture	Long real videos	Paired before-after data via erasure/cross-pair
SynthGen	Synthetic (LLM-T2I-I2V)	Diversity and control over subject-scene pairs
SimInteract	Simulated (Houdini)	Realistic subject–scene interactions

3. Condition-Specific Feature Injection

OmniInsert defines a differentiated approach to injecting multi-source conditions into the diffusion transformer:

Background (video) condition leverages channel-wise concatenation for spatial alignment. The noisy target video latent is computed as $z_t^{(T)} = (1-t) z^{(T)} + t \epsilon$ (where $t$ is noise strength), and the injection $z_t^{(Vid)} = \mathrm{Concat}([z_t^{(T)}, z^{(S)}, f^{(S)}],\, \text{dim}=1)$ combines target and source latent features with a flag map $f^{(S)}$ set to zeros.
Subject condition uses temporal-wise concatenation to encode time-varying subject features. $z_t^{(Sub)} = \mathrm{Concat}([z_t^{(I)}, z^{(I)}, f^{(I)}],\, \text{dim}=1)$ , with $f^{(I)}$ an all-one flag highlighting subject regions.

By concatenating these condition-specific latents along the frame dimension, the model maintains separation, appropriately blending static backgrounds and dynamic subject features in the denoising process. This mechanism preserves spatial-temporal alignment and avoids background disruption when inserting new content.

4. Progressive Training and Optimization Protocols

To ensure robust subject and scene integration, OmniInsert introduces a cascaded, four-phase progressive training protocol:

Subject-to-Video Training: The network learns subject representation and motion patterns independent of source video, using only reference images and text prompts.
Full MVI Pretraining: Introduction of the source video as an additional condition allows the model to learn joint subject-background blending, though initial identity consistency may be suboptimal.
Refinement: Fine-tuning on curated high-fidelity portraits and synthetic renderings sharpens the subject identity and robustness in complex backgrounds.
Insertive Preference Optimization (IPO): Preference-based fine-tuning employs human-annotated pairs (preferred vs. dispreferred outcomes) to optimize a trainable LoRA module $\pi_\theta$ against a reference model $\pi_{ref}$ , using the loss:

$L_{IPO} = L_{DPO} + \lambda \cdot \mathbb{E}\left[ \left( -\log \pi_\theta(y_l|x) - \gamma\right)^2 \right]$

$L_{DPO} = -\mathbb{E} \left[ \log \sigma\left( \beta \log \left( \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} \right) - \beta \log \left( \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right) \right)\right]$

where $y_w$ and $y_l$ denote human-preferred and non-preferred outcomes, respectively, and $\sigma$ is the sigmoid function.

This protocol enables the model to first master subject synthesis, then gradually balance and harmonize background integration and subject fidelity.

5. Loss Functions and Harmonization Modules

Standard flow matching loss $L_{FM}$ is insufficient to enforce subject detail by treating all regions equally. The Subject-Focused Loss $L_{SL}$ emphasizes subject regions via a spatial mask $M$ :

$L_{SL} = \mathbb{E}[ \| M \cdot ((z_0 - \epsilon) - V_\theta(z_t, t, y)) \|^2 ]$

with $z_0$ as the clean video latent and $V_\theta(\cdot)$ as the model prediction. This term ensures fine-grained subject detail and appearance consistency throughout the video sequence.

At inference, the Context-Aware Rephraser (CAR) module further refines semantic harmonization. By utilizing a vision–LLM (VLM), CAR expands and rephrases text prompts to offer the diffusion model more detailed, context-specific insertion cues (object texture, spatial relations, etc.), improving semantic alignment between inserted subject and source scene.

6. InsertBench Benchmark and Quantitative Evaluation

OmniInsert introduces InsertBench, the first comprehensive benchmark for mask-free video insertion, containing 120 five-second clips (indoor, outdoor, animated, wearable scenarios) with paired reference subjects and text prompts. Evaluation employs metrics for:

Subject consistency: CLIP-I, DINO-I, FaceSim
Text–video alignment: ViCLIP-T
Video quality: dynamic quality, image fidelity, aesthetics, intra-sequence consistency

On InsertBench, OmniInsert demonstrates improved performance over closed-source commercial solutions such as Pika-Pro and Kling. Higher scores on subject identity preservation and text–video alignment, enhanced dynamic quality, and positive results from human preference studies indicate substantial gains in both quantitative and qualitative integration.

7. Comparative Context and Broader Implications

OmniInsert departs from prior methods requiring explicit mask generation, scripted control signals, or segmented subject proxies. The condition-specific feature injection architecture, progressive training schedule, and harmonization modules collectively solve the core subject–scene equilibrium and data limitations of previous approaches.

A plausible implication is that the InsertPipe pipeline and IPO module conceptually generalize to any domain where paired data scarcity and component blending pose challenges. The efficacy demonstrated on InsertBench suggests potential extension to broader multimodal synthesis problems, although further paper is needed to delineate transferability beyond video insertion.

In summary, OmniInsert’s mask-free, reference-driven architecture represents a substantive advance in subject-preserving, scene-consistent video editing, supported by carefully constructed data pipelines, articulated condition management, and standardized benchmarking (Chen et al., 22 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models (2025)

Follow Topic

Get notified by email when new papers are published related to OmniInsert.