Papers
Topics
Authors
Recent
Search
2000 character limit reached

O-DisCo-Edit: Unified Video Editing

Updated 4 July 2026
  • The paper introduces object distortion control (O-DisCo) as a unified editing cue, replacing task-specific signals with a distorted RGB reference.
  • It employs a copy-form preservation mechanism that copies latent features from non-edited regions to maintain background fidelity and temporal stability.
  • O-DisCo-Edit achieves competitive results across tasks like removal, outpainting, and motion transfer using efficient LoRA tuning on a frozen diffusion backbone.

Searching arXiv for the primary paper and a few closely related works mentioned in the source material. O-DisCo-Edit, short for “Object Distortion Control for Unified Realistic Video Editing,” is a first-frame-guided video editing framework built on the CogVideoX I2V backbone for unified realistic video editing across object removal, outpainting, motion transfer, lighting transfer, color change, swap, addition, and style transfer (Chen et al., 1 Sep 2025). Its central premise is that diverse editing cues can be represented through a single control type—an RGB-space “distorted video” of the reference object region—rather than through task-specific signals such as masks, optical flow, keypoints, or multi-branch control modules. The framework couples this unified control representation, termed Object Distortion Control (O-DisCo), with a “copy-form” preservation mechanism that directly copies non-edited latent regions into the denoising stream, with the stated goal of achieving precise local editing, preservation of non-edited regions, temporal stability, and lower training complexity within a single model (Chen et al., 1 Sep 2025).

1. Unified formulation and problem setting

Diffusion-based video editors commonly take a reference video and an editing condition, then synthesize an edited video. The problem, as defined for O-DisCo-Edit, has two core requirements: precisely control edited regions and preserve non-edited regions such as backgrounds and untouched objects (Chen et al., 1 Sep 2025). The paper situates existing approaches as relying on task-specific signals—masks for inpainting or removal, optical flow for motion transfer, keypoints for pose control—and often requiring separate models or heavy multi-branch architectures. It further characterizes these systems as inflexible at inference, typically bound to exactly one control signal and poorly suited to combining coarse and fine cues (Chen et al., 1 Sep 2025).

Within this formulation, “unified” has three explicit meanings. First, O-DisCo-Edit adopts a unified control representation: all editing cues are encoded into the same data type, namely a distorted RGB video processed by the same VAE and DiT as other inputs. Second, it adopts a unified architecture: one CogVideoX-based backbone with a condition LoRA for O-DisCo, a copy-form preservation module, and an optional identity preservation module. Third, it adopts unified training: a single training dataset of approximately 180K video-mask pairs, without explicit multi-task pseudo-label construction or multiple control heads (Chen et al., 1 Sep 2025). A plausible implication is that the system treats multi-task editing as a single conditional generation problem rather than as a collection of separate task-conditioned pipelines.

A concise summary of the framework’s principal components is given below.

Component Mechanism Role
O-DisCo Distorted RGB video in the masked region Unified editing control
CFP Direct copy of non-edited latent regions Preservation of non-edited areas
IDP ID tokens and “ID-Resample” Identity consistency

This organization matters because the paper’s claim to unification does not mean that explicit user control disappears. The inference interface still requires a reference video, an edited first frame, a mask, and optionally text; what is unified is the model-side control pathway, not the absence of editing inputs (Chen et al., 1 Sep 2025).

2. Object Distortion Control

O-DisCo defines editing cues as controlled distortions of the object region in the reference video (Chen et al., 1 Sep 2025). The paper lists color changes, low-resolution or mosaic degradation, blur, intensity changes, and complete zeroing-out as representative distortions. It explicitly interprets several conventional signals through this lens: a mask as an extreme distortion, optical flow or trajectory as a form of warping or distortion, and style or color change as appearance distortion. The key claim is that the model learns a single mapping from such distorted reference videos to the desired edits rather than separate mappings for each signal structure (Chen et al., 1 Sep 2025).

The method distinguishes between Random Object Distortion Control (R-O-DisCo), used during training, and Adaptive Object Distortion Control (A-O-DisCo), used during inference. R-O-DisCo is constructed from a reference video VrefV_\text{ref} and binary spatio-temporal mask MM by applying random channel distortion, then mosaicking via average pooling and upsampling, and finally compositing the distorted object region back into the original video. Its final form is

VRODC=Vcdm⊙M+Vref⊙(1−M).V_\text{RODC} = V_\text{cdm} \odot M + V_\text{ref} \odot (\mathbf{1} - M).

The paper specifies an algorithmic version in which the scaling factor is sampled as θâˆĵU(1.5,3.0)\theta \sim U(1.5,3.0), the target channel c∗∈{0,1,2}c^* \in \{0,1,2\}, the color offset δ∈{−100,−50,50,100}\delta \in \{-100,-50,50,100\}, the block size b∈{8,10,12,15,16,20,24}b \in \{8,10,12,15,16,20,24\}, and the scaling mode Îĵ∈{0,1}\mu \in \{0,1\} (Chen et al., 1 Sep 2025). The stated purpose of the mosaic-like degradation is to destroy object detail so that the model cannot simply copy colors from O-DisCo and must instead rely on the first frame for fine appearance while still using O-DisCo as coarse guidance (Chen et al., 1 Sep 2025).

A-O-DisCo replaces training-time randomness with task-adaptive distortion at inference. It first applies contrast scaling with adaptive parameter Îħ\alpha, then Gaussian blur with kernel size k=2b+1k=2b+1 and standard deviation MM0, and then composites the result into the masked region:

MM1

The adaptive parameters are derived from two similarity measures: MM2, an inter-image similarity between the edge map of the first frame of the reference video and that of the edited first frame; and MM3, an intra-video similarity between consecutive frames in the edited region (Chen et al., 1 Sep 2025). Using MM4 and masked SSIM, the paper defines

MM5

and

MM6

These feed three quadratic functions,

MM7

from which the parameters are computed as

MM8

For object removal and outpainting, the framework uses a special case: MM9, so the model relies purely on first-frame editing and global conditioning (Chen et al., 1 Sep 2025).

The paper also states where O-DisCo enters the network. It is encoded by the same VAE as the reference video and first frame, and its latent is consumed as an additional condition in the conditional DiT via a condition LoRA. No separate O-DisCo branch is introduced; the existing network is reused (Chen et al., 1 Sep 2025).

3. Copy-form preservation and identity consistency

The “copy-form” preservation (CFP) module is designed to preserve non-edited regions by directly copying latent features of those regions from the reference video into the main latent input rather than routing them through a separate condition branch (Chen et al., 1 Sep 2025). The paper contrasts this with standard multi-task designs that maintain separate branches for preserved areas and control signals, then fuse them later, which can cause interference between control and preservation. CFP instead encodes “just keep these as is” at the latent-input level (Chen et al., 1 Sep 2025).

Let VRODC=Vcdm⊙M+Vref⊙(1−M).V_\text{RODC} = V_\text{cdm} \odot M + V_\text{ref} \odot (\mathbf{1} - M).0 denote the latent of the reference video, VRODC=Vcdm⊙M+Vref⊙(1−M).V_\text{RODC} = V_\text{cdm} \odot M + V_\text{ref} \odot (\mathbf{1} - M).1 the latent of the reference image or edited first frame, and VRODC=Vcdm⊙M+Vref⊙(1−M).V_\text{RODC} = V_\text{cdm} \odot M + V_\text{ref} \odot (\mathbf{1} - M).2 the downsampled binary mask in latent space. CFP forms a preserved-video latent

VRODC=Vcdm⊙M+Vref⊙(1−M).V_\text{RODC} = V_\text{cdm} \odot M + V_\text{ref} \odot (\mathbf{1} - M).3

then concatenates this with the first-frame latent:

VRODC=Vcdm⊙M+Vref⊙(1−M).V_\text{RODC} = V_\text{cdm} \odot M + V_\text{ref} \odot (\mathbf{1} - M).4

The slicing operator VRODC=Vcdm⊙M+Vref⊙(1−M).V_\text{RODC} = V_\text{cdm} \odot M + V_\text{ref} \odot (\mathbf{1} - M).5 removes the first frame, so preservation is applied only to later frames. The paper emphasizes that this replaces the usual zero-padding for later frames in the denoising input (Chen et al., 1 Sep 2025). During diffusion, the network therefore receives an initialization that already encodes copied background content for non-edited regions.

CFP has no explicit loss term; its effect is induced through the overall diffusion objective. In ablations, the paper reports substantial improvement in PSNRVRODC=Vcdm⊙M+Vref⊙(1−M).V_\text{RODC} = V_\text{cdm} \odot M + V_\text{ref} \odot (\mathbf{1} - M).6 and SSIMVRODC=Vcdm⊙M+Vref⊙(1−M).V_\text{RODC} = V_\text{cdm} \odot M + V_\text{ref} \odot (\mathbf{1} - M).7, which measure fidelity in non-edited regions, together with improvement in FVD and overall quality, without harming editability (Chen et al., 1 Sep 2025). During training, the latent-space mask is randomly dilated by max-pooling with kernel size sampled from VRODC=Vcdm⊙M+Vref⊙(1−M).V_\text{RODC} = V_\text{cdm} \odot M + V_\text{ref} \odot (\mathbf{1} - M).8; at inference, the dilation size can be chosen to adjust how much area is treated as editable versus preserved (Chen et al., 1 Sep 2025).

The architecture also includes an optional identity preservation (IDP) module. According to the paper, ID tokens are extracted from the edited region of the reference image and concatenated with text tokens as an extra global condition, while “ID-Resample” re-samples keys and values in attention from generated-video edited regions to enforce identity consistency (Chen et al., 1 Sep 2025). In the reported ablations, IDP improves CLIP-IVRODC=Vcdm⊙M+Vref⊙(1−M).V_\text{RODC} = V_\text{cdm} \odot M + V_\text{ref} \odot (\mathbf{1} - M).9 in swap and improves SSIMθâˆĵU(1.5,3.0)\theta \sim U(1.5,3.0)0 and PSNRθâˆĵU(1.5,3.0)\theta \sim U(1.5,3.0)1 in removal (Chen et al., 1 Sep 2025). This suggests that O-DisCo-Edit separates three concerns: local control through distortion, preservation through latent copying, and identity continuity through ID-conditioned attention.

4. Backbone, training paradigm, and inference interface

O-DisCo-Edit is built on a pretrained CogVideoX I2V-based Diffusion-as-Shader backbone that remains frozen during fine-tuning (Chen et al., 1 Sep 2025). The paper describes three add-on components: a condition LoRA on the DiT to ingest O-DisCo latent and text tokens, the CFP module that modifies the initial latent θâˆĵU(1.5,3.0)\theta \sim U(1.5,3.0)2, and the IDP module for identity preservation. All core DiT weights remain frozen; only LoRAs are trained (Chen et al., 1 Sep 2025).

Training uses approximately θâˆĵU(1.5,3.0)\theta \sim U(1.5,3.0)3 video-mask pairs from the Senorita-2M “grounding” subset, center-cropped and resized to θâˆĵU(1.5,3.0)\theta \sim U(1.5,3.0)4 with 49 frames, and text prompts for masked regions generated by Qwen2.5-VL-7B (Chen et al., 1 Sep 2025). Each training sample contains a reference video, a mask video, and a text description of the object or region. The training process is explicitly two-stage. In stage 1, the base Diffusion-as-Shader is frozen, R-O-DisCo is used as O-DisCo, CFP is used to build θâˆĵU(1.5,3.0)\theta \sim U(1.5,3.0)5, and the condition LoRA is trained for 2400 steps using AdamW with learning rate θâˆĵU(1.5,3.0)\theta \sim U(1.5,3.0)6, on 8 A800 GPUs with batch size 32. In stage 2, the base model and condition LoRA are fixed and a separate ID LoRA is trained for 5150 steps (Chen et al., 1 Sep 2025).

The paper stresses that no explicit multi-task labels or separate losses for different tasks are introduced. Multi-task behavior is instead attributed to the robustness induced by R-O-DisCo together with first-frame conditioning (Chen et al., 1 Sep 2025). A plausible implication is that the framework’s unification comes less from semantic task taxonomy than from a shared perturbation-to-edit mapping learned over a broad edit distribution.

At inference, the interface consists of four inputs: the reference video θâˆĵU(1.5,3.0)\theta \sim U(1.5,3.0)7, an edited first frame θâˆĵU(1.5,3.0)\theta \sim U(1.5,3.0)8, a mask θâˆĵU(1.5,3.0)\theta \sim U(1.5,3.0)9, and an optional textual instruction (Chen et al., 1 Sep 2025). The edited first frame may be produced by an external image editor such as HiDream-E1 or commercial tools. The inference pipeline then computes A-O-DisCo from c∗∈{0,1,2}c^* \in \{0,1,2\}0, c∗∈{0,1,2}c^* \in \{0,1,2\}1, and c∗∈{0,1,2}c^* \in \{0,1,2\}2; encodes inputs with the VAE; builds c∗∈{0,1,2}c^* \in \{0,1,2\}3 via CFP; and runs diffusion sampling to produce the edited video (Chen et al., 1 Sep 2025). The paper explicitly contrasts this with prior systems such as VACE, Senorita, and VideoPainter, which are described as requiring multiple signals or task-specific modules, whereas O-DisCo-Edit uses a single generic O-DisCo signal and mask for all tasks within a single unified LoRA-augmented DiT (Chen et al., 1 Sep 2025).

5. Editing capabilities and empirical performance

The framework supports object removal, outpainting, object internal motion transfer, lighting transfer, color change, object swap, object addition, and style transfer (Chen et al., 1 Sep 2025). The paper describes these tasks as arising from different configurations of the edited first frame, mask, and adaptive O-DisCo parameters rather than from explicit task labels. For example, removal and outpainting use c∗∈{0,1,2}c^* \in \{0,1,2\}4; style transfer and color change use moderate blur and contrast parameters; motion and lighting transfer use O-DisCo to preserve structure while changing internal dynamics or lighting response; and swap or addition use the edited first frame to define a new identity while O-DisCo preserves underlying motion patterns (Chen et al., 1 Sep 2025).

Evaluation is carried out on two benchmarks. A custom multi-task benchmark contains 134 triplets of video, edited first frame, and mask from DAVIS and VPData, spanning outpainting, internal motion transfer, lighting transfer, color change, swap, addition, and style transfer at c∗∈{0,1,2}c^* \in \{0,1,2\}5 resolution and 49 frames. Object removal is evaluated on OmnimatteRF, using first-frame edits generated by a commercial editor for fairness (Chen et al., 1 Sep 2025). Baselines include multi-task or unified editors—VACE (1.3B and 14B), Senorita, and VideoPainter—and specialized object removal systems DiffuEraser, MiniMax-Remover, and ProPainter (Chen et al., 1 Sep 2025).

The metrics cover multiple dimensions: FVD, PSNR, SSIM, TC, and ArtFID for video quality; CLIP-T and CLIP-Ic∗∈{0,1,2}c^* \in \{0,1,2\}6 for alignment; PSNRc∗∈{0,1,2}c^* \in \{0,1,2\}7 and SSIMc∗∈{0,1,2}c^* \in \{0,1,2\}8 for preservation of non-edited regions; SSIMc∗∈{0,1,2}c^* \in \{0,1,2\}9 and PSNRδ∈{−100,−50,50,100}\delta \in \{-100,-50,50,100\}0 for removal quality; CFSD for style transfer; min-max normalized average score; and MOS-based user studies for editing completeness and video quality (Chen et al., 1 Sep 2025).

Across eight tasks, the paper reports that O-DisCo-Edit typically achieves best or second-best scores and often dominates both specialized and multi-task baselines (Chen et al., 1 Sep 2025). On OmnimatteRF object removal, it is reported to outperform specialized MiniMax-Remover, ProPainter, and DiffuEraser, including a normalized average score of δ∈{−100,−50,50,100}\delta \in \{-100,-50,50,100\}1 versus δ∈{−100,−50,50,100}\delta \in \{-100,-50,50,100\}2 on Remove(49). For outpainting, it reports best FVD δ∈{−100,−50,50,100}\delta \in \{-100,-50,50,100\}3 versus δ∈{−100,−50,50,100}\delta \in \{-100,-50,50,100\}4 for VACE 1.3B, together with PSNR δ∈{−100,−50,50,100}\delta \in \{-100,-50,50,100\}5, PSNRδ∈{−100,−50,50,100}\delta \in \{-100,-50,50,100\}6 δ∈{−100,−50,50,100}\delta \in \{-100,-50,50,100\}7, and SSIMδ∈{−100,−50,50,100}\delta \in \{-100,-50,50,100\}8 δ∈{−100,−50,50,100}\delta \in \{-100,-50,50,100\}9 (Chen et al., 1 Sep 2025). For internal motion transfer, lighting transfer, and color change, it is described as top or near-top in ArtFID, CLIP-Ib∈{8,10,12,15,16,20,24}b \in \{8,10,12,15,16,20,24\}0, and preservation metrics, with best or tied TC. For swap and addition, automatic metrics are described as comparable to VACE, often with better CLIP-Ib∈{8,10,12,15,16,20,24}b \in \{8,10,12,15,16,20,24\}1, while user studies rank it highly (Chen et al., 1 Sep 2025).

The user study reports preference for O-DisCo-Edit in editing completeness and video quality for most tasks, including object removal, outpainting, internal motion transfer, lighting transfer, swap, addition, and style transfer. The paper notes one exception: for color change, Senorita slightly edges it out in style satisfaction, but with worse non-edited preservation (Chen et al., 1 Sep 2025).

Ablation studies isolate the contribution of CFP, adaptive A-O-DisCo, and IDP. The reported findings are that CFP is crucial for background preservation; adaptive O-DisCo improves FVD, PSNRb∈{8,10,12,15,16,20,24}b \in \{8,10,12,15,16,20,24\}2, and SSIMb∈{8,10,12,15,16,20,24}b \in \{8,10,12,15,16,20,24\}3 and reduces outpainting boundary artifacts; IDP improves identity consistency in swap and fidelity in removal; and the combination of all three yields the best overall metrics (Chen et al., 1 Sep 2025).

6. Efficiency, limitations, and relation to prior work

The framework’s efficiency claim rests on LoRA-only adaptation and a short two-stage training schedule. Training uses approximately 7550 steps in total—2400 for the condition LoRA and 5150 for the ID LoRA—on 8 A800 GPUs with batch size 32 (Chen et al., 1 Sep 2025). The paper contrasts this with VACE, described as using 8 DiT blocks and 200K steps on 128 A100s; Senorita, described as using 102 blocks and multi-stage training; and VideoPainter, described as using 2 blocks plus 1 LoRA on 390K videos with 82K steps on 64 V100s (Chen et al., 1 Sep 2025). At inference, the paper states that the model uses a single DiT pass per sampling step like baseline CogVideoX, with overhead limited to additional inputs and LoRA multiplications (Chen et al., 1 Sep 2025).

The framework’s main limitations are also stated explicitly. In swap tasks involving complex four-limbed animal motions, O-DisCo-Edit sometimes misaligns legs; the paper adds that VACE 14B, with larger capacity, does slightly better in such cases (Chen et al., 1 Sep 2025). The authors attribute this to base-model limitations, limited training examples with complex animal motion, and a small trainable parameter budget under LoRA-only tuning (Chen et al., 1 Sep 2025). The method is also described as depending strongly on the quality of the first-frame edit: poor or inconsistent first-frame edits degrade video quality (Chen et al., 1 Sep 2025).

In relation to prior work, the paper positions single-task editors as relying on specific signals and separate networks, and multi-task editors such as VACE, Senorita, OmniV2V, and UNIC as integrating multiple signals via task-specific modules, adapters, and multi-stage training (Chen et al., 1 Sep 2025). O-DisCo-Edit’s contribution is defined more narrowly: a unified control representation based on distortion, a direct latent-copy preservation mechanism, and an empirical demonstration that LoRA fine-tuning of a strong frozen backbone can match or surpass more heavily engineered multi-task systems (Chen et al., 1 Sep 2025). This suggests that the framework’s novelty lies less in inventing a new diffusion backbone than in reframing control as object distortion and preservation as direct latent copying.

The broader-impact discussion is similarly conventional but explicit. The paper identifies positive uses in content creation, film, advertising, and educational content, and notes that lower training cost can encourage broader research and more sustainable experimentation. It also identifies the risks typical of realistic video editing tools: deepfake creation, misleading or malicious content, and reinforcement of stereotypes in manipulated human imagery. The authors state that the model will be released under responsible-use constraints and emphasize the need for licensing and guidelines (Chen et al., 1 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to O-DisCo-Edit.