O-DisCo-Edit: Unified Video Editing
- The paper introduces object distortion control (O-DisCo) as a unified editing cue, replacing task-specific signals with a distorted RGB reference.
- It employs a copy-form preservation mechanism that copies latent features from non-edited regions to maintain background fidelity and temporal stability.
- O-DisCo-Edit achieves competitive results across tasks like removal, outpainting, and motion transfer using efficient LoRA tuning on a frozen diffusion backbone.
Searching arXiv for the primary paper and a few closely related works mentioned in the source material. O-DisCo-Edit, short for âObject Distortion Control for Unified Realistic Video Editing,â is a first-frame-guided video editing framework built on the CogVideoX I2V backbone for unified realistic video editing across object removal, outpainting, motion transfer, lighting transfer, color change, swap, addition, and style transfer (Chen et al., 1 Sep 2025). Its central premise is that diverse editing cues can be represented through a single control typeâan RGB-space âdistorted videoâ of the reference object regionârather than through task-specific signals such as masks, optical flow, keypoints, or multi-branch control modules. The framework couples this unified control representation, termed Object Distortion Control (O-DisCo), with a âcopy-formâ preservation mechanism that directly copies non-edited latent regions into the denoising stream, with the stated goal of achieving precise local editing, preservation of non-edited regions, temporal stability, and lower training complexity within a single model (Chen et al., 1 Sep 2025).
1. Unified formulation and problem setting
Diffusion-based video editors commonly take a reference video and an editing condition, then synthesize an edited video. The problem, as defined for O-DisCo-Edit, has two core requirements: precisely control edited regions and preserve non-edited regions such as backgrounds and untouched objects (Chen et al., 1 Sep 2025). The paper situates existing approaches as relying on task-specific signalsâmasks for inpainting or removal, optical flow for motion transfer, keypoints for pose controlâand often requiring separate models or heavy multi-branch architectures. It further characterizes these systems as inflexible at inference, typically bound to exactly one control signal and poorly suited to combining coarse and fine cues (Chen et al., 1 Sep 2025).
Within this formulation, âunifiedâ has three explicit meanings. First, O-DisCo-Edit adopts a unified control representation: all editing cues are encoded into the same data type, namely a distorted RGB video processed by the same VAE and DiT as other inputs. Second, it adopts a unified architecture: one CogVideoX-based backbone with a condition LoRA for O-DisCo, a copy-form preservation module, and an optional identity preservation module. Third, it adopts unified training: a single training dataset of approximately 180K video-mask pairs, without explicit multi-task pseudo-label construction or multiple control heads (Chen et al., 1 Sep 2025). A plausible implication is that the system treats multi-task editing as a single conditional generation problem rather than as a collection of separate task-conditioned pipelines.
A concise summary of the frameworkâs principal components is given below.
| Component | Mechanism | Role |
|---|---|---|
| O-DisCo | Distorted RGB video in the masked region | Unified editing control |
| CFP | Direct copy of non-edited latent regions | Preservation of non-edited areas |
| IDP | ID tokens and âID-Resampleâ | Identity consistency |
This organization matters because the paperâs claim to unification does not mean that explicit user control disappears. The inference interface still requires a reference video, an edited first frame, a mask, and optionally text; what is unified is the model-side control pathway, not the absence of editing inputs (Chen et al., 1 Sep 2025).
2. Object Distortion Control
O-DisCo defines editing cues as controlled distortions of the object region in the reference video (Chen et al., 1 Sep 2025). The paper lists color changes, low-resolution or mosaic degradation, blur, intensity changes, and complete zeroing-out as representative distortions. It explicitly interprets several conventional signals through this lens: a mask as an extreme distortion, optical flow or trajectory as a form of warping or distortion, and style or color change as appearance distortion. The key claim is that the model learns a single mapping from such distorted reference videos to the desired edits rather than separate mappings for each signal structure (Chen et al., 1 Sep 2025).
The method distinguishes between Random Object Distortion Control (R-O-DisCo), used during training, and Adaptive Object Distortion Control (A-O-DisCo), used during inference. R-O-DisCo is constructed from a reference video and binary spatio-temporal mask by applying random channel distortion, then mosaicking via average pooling and upsampling, and finally compositing the distorted object region back into the original video. Its final form is
The paper specifies an algorithmic version in which the scaling factor is sampled as , the target channel , the color offset , the block size , and the scaling mode (Chen et al., 1 Sep 2025). The stated purpose of the mosaic-like degradation is to destroy object detail so that the model cannot simply copy colors from O-DisCo and must instead rely on the first frame for fine appearance while still using O-DisCo as coarse guidance (Chen et al., 1 Sep 2025).
A-O-DisCo replaces training-time randomness with task-adaptive distortion at inference. It first applies contrast scaling with adaptive parameter , then Gaussian blur with kernel size and standard deviation 0, and then composites the result into the masked region:
1
The adaptive parameters are derived from two similarity measures: 2, an inter-image similarity between the edge map of the first frame of the reference video and that of the edited first frame; and 3, an intra-video similarity between consecutive frames in the edited region (Chen et al., 1 Sep 2025). Using 4 and masked SSIM, the paper defines
5
and
6
These feed three quadratic functions,
7
from which the parameters are computed as
8
For object removal and outpainting, the framework uses a special case: 9, so the model relies purely on first-frame editing and global conditioning (Chen et al., 1 Sep 2025).
The paper also states where O-DisCo enters the network. It is encoded by the same VAE as the reference video and first frame, and its latent is consumed as an additional condition in the conditional DiT via a condition LoRA. No separate O-DisCo branch is introduced; the existing network is reused (Chen et al., 1 Sep 2025).
3. Copy-form preservation and identity consistency
The âcopy-formâ preservation (CFP) module is designed to preserve non-edited regions by directly copying latent features of those regions from the reference video into the main latent input rather than routing them through a separate condition branch (Chen et al., 1 Sep 2025). The paper contrasts this with standard multi-task designs that maintain separate branches for preserved areas and control signals, then fuse them later, which can cause interference between control and preservation. CFP instead encodes âjust keep these as isâ at the latent-input level (Chen et al., 1 Sep 2025).
Let 0 denote the latent of the reference video, 1 the latent of the reference image or edited first frame, and 2 the downsampled binary mask in latent space. CFP forms a preserved-video latent
3
then concatenates this with the first-frame latent:
4
The slicing operator 5 removes the first frame, so preservation is applied only to later frames. The paper emphasizes that this replaces the usual zero-padding for later frames in the denoising input (Chen et al., 1 Sep 2025). During diffusion, the network therefore receives an initialization that already encodes copied background content for non-edited regions.
CFP has no explicit loss term; its effect is induced through the overall diffusion objective. In ablations, the paper reports substantial improvement in PSNR6 and SSIM7, which measure fidelity in non-edited regions, together with improvement in FVD and overall quality, without harming editability (Chen et al., 1 Sep 2025). During training, the latent-space mask is randomly dilated by max-pooling with kernel size sampled from 8; at inference, the dilation size can be chosen to adjust how much area is treated as editable versus preserved (Chen et al., 1 Sep 2025).
The architecture also includes an optional identity preservation (IDP) module. According to the paper, ID tokens are extracted from the edited region of the reference image and concatenated with text tokens as an extra global condition, while âID-Resampleâ re-samples keys and values in attention from generated-video edited regions to enforce identity consistency (Chen et al., 1 Sep 2025). In the reported ablations, IDP improves CLIP-I9 in swap and improves SSIM0 and PSNR1 in removal (Chen et al., 1 Sep 2025). This suggests that O-DisCo-Edit separates three concerns: local control through distortion, preservation through latent copying, and identity continuity through ID-conditioned attention.
4. Backbone, training paradigm, and inference interface
O-DisCo-Edit is built on a pretrained CogVideoX I2V-based Diffusion-as-Shader backbone that remains frozen during fine-tuning (Chen et al., 1 Sep 2025). The paper describes three add-on components: a condition LoRA on the DiT to ingest O-DisCo latent and text tokens, the CFP module that modifies the initial latent 2, and the IDP module for identity preservation. All core DiT weights remain frozen; only LoRAs are trained (Chen et al., 1 Sep 2025).
Training uses approximately 3 video-mask pairs from the Senorita-2M âgroundingâ subset, center-cropped and resized to 4 with 49 frames, and text prompts for masked regions generated by Qwen2.5-VL-7B (Chen et al., 1 Sep 2025). Each training sample contains a reference video, a mask video, and a text description of the object or region. The training process is explicitly two-stage. In stage 1, the base Diffusion-as-Shader is frozen, R-O-DisCo is used as O-DisCo, CFP is used to build 5, and the condition LoRA is trained for 2400 steps using AdamW with learning rate 6, on 8 A800 GPUs with batch size 32. In stage 2, the base model and condition LoRA are fixed and a separate ID LoRA is trained for 5150 steps (Chen et al., 1 Sep 2025).
The paper stresses that no explicit multi-task labels or separate losses for different tasks are introduced. Multi-task behavior is instead attributed to the robustness induced by R-O-DisCo together with first-frame conditioning (Chen et al., 1 Sep 2025). A plausible implication is that the frameworkâs unification comes less from semantic task taxonomy than from a shared perturbation-to-edit mapping learned over a broad edit distribution.
At inference, the interface consists of four inputs: the reference video 7, an edited first frame 8, a mask 9, and an optional textual instruction (Chen et al., 1 Sep 2025). The edited first frame may be produced by an external image editor such as HiDream-E1 or commercial tools. The inference pipeline then computes A-O-DisCo from 0, 1, and 2; encodes inputs with the VAE; builds 3 via CFP; and runs diffusion sampling to produce the edited video (Chen et al., 1 Sep 2025). The paper explicitly contrasts this with prior systems such as VACE, Senorita, and VideoPainter, which are described as requiring multiple signals or task-specific modules, whereas O-DisCo-Edit uses a single generic O-DisCo signal and mask for all tasks within a single unified LoRA-augmented DiT (Chen et al., 1 Sep 2025).
5. Editing capabilities and empirical performance
The framework supports object removal, outpainting, object internal motion transfer, lighting transfer, color change, object swap, object addition, and style transfer (Chen et al., 1 Sep 2025). The paper describes these tasks as arising from different configurations of the edited first frame, mask, and adaptive O-DisCo parameters rather than from explicit task labels. For example, removal and outpainting use 4; style transfer and color change use moderate blur and contrast parameters; motion and lighting transfer use O-DisCo to preserve structure while changing internal dynamics or lighting response; and swap or addition use the edited first frame to define a new identity while O-DisCo preserves underlying motion patterns (Chen et al., 1 Sep 2025).
Evaluation is carried out on two benchmarks. A custom multi-task benchmark contains 134 triplets of video, edited first frame, and mask from DAVIS and VPData, spanning outpainting, internal motion transfer, lighting transfer, color change, swap, addition, and style transfer at 5 resolution and 49 frames. Object removal is evaluated on OmnimatteRF, using first-frame edits generated by a commercial editor for fairness (Chen et al., 1 Sep 2025). Baselines include multi-task or unified editorsâVACE (1.3B and 14B), Senorita, and VideoPainterâand specialized object removal systems DiffuEraser, MiniMax-Remover, and ProPainter (Chen et al., 1 Sep 2025).
The metrics cover multiple dimensions: FVD, PSNR, SSIM, TC, and ArtFID for video quality; CLIP-T and CLIP-I6 for alignment; PSNR7 and SSIM8 for preservation of non-edited regions; SSIM9 and PSNR0 for removal quality; CFSD for style transfer; min-max normalized average score; and MOS-based user studies for editing completeness and video quality (Chen et al., 1 Sep 2025).
Across eight tasks, the paper reports that O-DisCo-Edit typically achieves best or second-best scores and often dominates both specialized and multi-task baselines (Chen et al., 1 Sep 2025). On OmnimatteRF object removal, it is reported to outperform specialized MiniMax-Remover, ProPainter, and DiffuEraser, including a normalized average score of 1 versus 2 on Remove(49). For outpainting, it reports best FVD 3 versus 4 for VACE 1.3B, together with PSNR 5, PSNR6 7, and SSIM8 9 (Chen et al., 1 Sep 2025). For internal motion transfer, lighting transfer, and color change, it is described as top or near-top in ArtFID, CLIP-I0, and preservation metrics, with best or tied TC. For swap and addition, automatic metrics are described as comparable to VACE, often with better CLIP-I1, while user studies rank it highly (Chen et al., 1 Sep 2025).
The user study reports preference for O-DisCo-Edit in editing completeness and video quality for most tasks, including object removal, outpainting, internal motion transfer, lighting transfer, swap, addition, and style transfer. The paper notes one exception: for color change, Senorita slightly edges it out in style satisfaction, but with worse non-edited preservation (Chen et al., 1 Sep 2025).
Ablation studies isolate the contribution of CFP, adaptive A-O-DisCo, and IDP. The reported findings are that CFP is crucial for background preservation; adaptive O-DisCo improves FVD, PSNR2, and SSIM3 and reduces outpainting boundary artifacts; IDP improves identity consistency in swap and fidelity in removal; and the combination of all three yields the best overall metrics (Chen et al., 1 Sep 2025).
6. Efficiency, limitations, and relation to prior work
The frameworkâs efficiency claim rests on LoRA-only adaptation and a short two-stage training schedule. Training uses approximately 7550 steps in totalâ2400 for the condition LoRA and 5150 for the ID LoRAâon 8 A800 GPUs with batch size 32 (Chen et al., 1 Sep 2025). The paper contrasts this with VACE, described as using 8 DiT blocks and 200K steps on 128 A100s; Senorita, described as using 102 blocks and multi-stage training; and VideoPainter, described as using 2 blocks plus 1 LoRA on 390K videos with 82K steps on 64 V100s (Chen et al., 1 Sep 2025). At inference, the paper states that the model uses a single DiT pass per sampling step like baseline CogVideoX, with overhead limited to additional inputs and LoRA multiplications (Chen et al., 1 Sep 2025).
The frameworkâs main limitations are also stated explicitly. In swap tasks involving complex four-limbed animal motions, O-DisCo-Edit sometimes misaligns legs; the paper adds that VACE 14B, with larger capacity, does slightly better in such cases (Chen et al., 1 Sep 2025). The authors attribute this to base-model limitations, limited training examples with complex animal motion, and a small trainable parameter budget under LoRA-only tuning (Chen et al., 1 Sep 2025). The method is also described as depending strongly on the quality of the first-frame edit: poor or inconsistent first-frame edits degrade video quality (Chen et al., 1 Sep 2025).
In relation to prior work, the paper positions single-task editors as relying on specific signals and separate networks, and multi-task editors such as VACE, Senorita, OmniV2V, and UNIC as integrating multiple signals via task-specific modules, adapters, and multi-stage training (Chen et al., 1 Sep 2025). O-DisCo-Editâs contribution is defined more narrowly: a unified control representation based on distortion, a direct latent-copy preservation mechanism, and an empirical demonstration that LoRA fine-tuning of a strong frozen backbone can match or surpass more heavily engineered multi-task systems (Chen et al., 1 Sep 2025). This suggests that the frameworkâs novelty lies less in inventing a new diffusion backbone than in reframing control as object distortion and preservation as direct latent copying.
The broader-impact discussion is similarly conventional but explicit. The paper identifies positive uses in content creation, film, advertising, and educational content, and notes that lower training cost can encourage broader research and more sustainable experimentation. It also identifies the risks typical of realistic video editing tools: deepfake creation, misleading or malicious content, and reinforcement of stereotypes in manipulated human imagery. The authors state that the model will be released under responsible-use constraints and emphasize the need for licensing and guidelines (Chen et al., 1 Sep 2025).