Papers
Topics
Authors
Recent
2000 character limit reached

EasyOmnimatte: Efficient Video Decomposition

Updated 2 January 2026
  • EasyOmnimatte is a video layered decomposition framework that separates foreground, scene effects, and background in one streamlined process.
  • It utilizes a pretrained video inpainting diffusion transformer with LoRA finetuning and a dual-expert mechanism to improve effect extraction and matte quality.
  • Experimental results demonstrate competitive PSNR, SSIM, and FVD metrics while significantly reducing computational cost for various video editing applications.

EasyOmnimatte is an end-to-end video layered decomposition framework designed to extract foreground objects, their associated scene effects (such as shadows and reflections), and the background from videos in a single, unified pass. Unlike previous omnimatte approaches, which employ slow, multi-stage, or inference-time optimization pipelines, EasyOmnimatte leverages finetuned video inpainting diffusion models and a novel dual-expert mechanism to achieve high-quality omnimatte decomposition with reduced computational cost and minimal user effort (Hu et al., 26 Dec 2025).

1. Pretrained Diffusion Backbone and Data Representation

EasyOmnimatte is built upon a pretrained video inpainting diffusion transformer (DiT) backbone. The network operates on sequences of NN video frames, V∈RN×H×W×3V \in \mathbb{R}^{N \times H \times W \times 3}, accompanied by coarse per-frame foreground masks M∈{0,1}N×H×WM \in \{0,1\}^{N \times H \times W} and optional text/background conditioning cc. The DDPM framework forms the basis for generative modeling, with pixel or latent space formulations,

q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

and denoising performed via stacked DiT blocks L1,...,LBL_1, ..., L_B containing multi-head self-attention and feed-forward networks. Pretraining targets object and effect removal, optimizing

Lpretrain=Ex,M,c,t,ϵ[∥ϵ−ϵθ(xt,M,c,t)∥2]L_\text{pretrain} = \mathbb{E}_{x, M, c, t, \epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, M, c, t)\|^2 \right]

to capture generative priors pertinent to object-effect separation (Hu et al., 26 Dec 2025).

2. LoRA Finetuning and Dual-Expert Architecture

Central to EasyOmnimatte is the application of Low-Rank Adaptation (LoRA) to the DiT blocks. LoRA introduces a trainable, rank-rr update ΔWb=AbBb\Delta W_b = A_b B_b that supplements the frozen block weights WbW_b with minimal overhead. The dual-expert strategy is motivated by the empirical observation that effect cues are encoded only in selected late-stage blocks, and indiscriminate LoRA fine-tuning suppresses effect extraction.

  • Effect Expert (θE\theta^E): LoRA adapters are applied exclusively to the last KK effect-sensitive blocks (Beffect={b∣b>B−K}B_\text{effect} = \{b \mid b > B - K\}), focusing on coarse foreground and effect decomposition.
  • Quality Expert (θQ\theta^Q): LoRA adapters are applied to all blocks (Bquality={1...B}B_\text{quality} = \{1 ... B\}), refining alpha matte details.

Weight updates for each expert are:

Wb(E)={Wb+AbEBbE,b∈Beffect Wb,otherwiseW_b^{(E)} = \begin{cases} W_b + A_b^E B_b^E, & b \in B_\text{effect} \ W_b, & \text{otherwise} \end{cases}

Wb(Q)=Wb+AbQBbQ,∀bW_b^{(Q)} = W_b + A_b^Q B_b^Q, \qquad \forall b

This division enables selective adaptation during inference, yielding higher-quality layered decompositions than prior approaches (Hu et al., 26 Dec 2025).

3. Training Objectives and Inference Procedure

Both experts are trained concurrently with synthetic matting datasets employing three loss components:

  • Background inpainting loss: Lbg=E∥H^−Hgt∥1L_\text{bg} = \mathbb{E} \| \hat{H} - H_\text{gt} \|_1
  • Alpha-matting loss:

Lα=E[∥α^−αgt∥1+λpermLperc(α^,αgt)]L_\alpha = \mathbb{E} [ \|\hat{\alpha} - \alpha_\text{gt}\|_1 + \lambda_\text{perm} L_\text{perc}(\hat{\alpha}, \alpha_\text{gt}) ]

  • Compositing consistency regularizer:

Lcomp=E∥V−(α^⊙F^+(1−α^)⊙H^)∥1L_\text{comp} = \mathbb{E} \| V - (\hat{\alpha} \odot \hat{F} + (1 - \hat{\alpha}) \odot \hat{H}) \|_1

Total expert loss:

Ltotal=Lbg+λαLα+λcLcompL_\text{total} = L_\text{bg} + \lambda_\alpha L_\alpha + \lambda_c L_\text{comp}

With hyperparameters: λα=1.0\lambda_\alpha = 1.0, λperm=0.1\lambda_\text{perm} = 0.1, λc=0.1\lambda_c = 0.1.

The inference schedule employs a linear βt\beta_t schedule across T=200T=200 timesteps. For t=T...1t = T ... 1, expert switching occurs at Tswitch=⌊τT⌋T_\text{switch} = \lfloor \tau T \rfloor (τ=0.5\tau=0.5):

θt={θE,t>Tswitch θQ,t≤Tswitch\theta_t = \begin{cases} \theta^E, & t > T_\text{switch} \ \theta^Q, & t \leq T_\text{switch} \end{cases}

Each denoising step invokes the corresponding expert, producing ϵ^t\hat{\epsilon}_t and updating xt−1x_{t-1} with negligible computational cost due to adapter-swapping rather than full network reload. This enables the Effect Expert to capture structure and effects in early, noisy steps, followed by the Quality Expert's detailed refinement (Hu et al., 26 Dec 2025).

4. Experimental Results and Ablation Analysis

Comprehensive ablations validate the dual-expert model. Performance metrics include PSNR, SSIM, intersection-over-union (IoUα\text{IoU}_\alpha) for alpha mattes, and FVD (Frechet Video Distance).

Table: LoRA Block Selection and Rank (Synthetic Set)

Method PSNR ↑ SSIM ↑ IoU α ↑ FVD ↓
Quality-only 25.8 0.77 0.68 120
Effect-only 24.5 0.72 0.64 135
Dual-Expert 26.2 0.79 0.71 105

Varying Ï„\tau (the expert-switch threshold) reveals optimal trade-offs on foreground/effect MSE.

Table: Ï„\tau Effects on MSE

Ï„ MSE_fg MSE_eff
0.2 0.0045 0.0123
0.5 0.0040 0.0101
0.8 0.0043 0.0092

Full quantitative comparisons to SOTA methods (e.g., BGMv2+shadow, MatAnyone+SAM, Gen-Omnimatte) demonstrate EasyOmnimatte achieves competitive PSNR (26.23), SSIM (0.7883), minimal warp error (100.94), and lowest FVD (105.48), with a wall-clock inference time of 9 seconds total (versus minutes per frame in earlier pipelines).

Table: SOTA Comparison

Method PSNR SSIM WE FVD Time (s)
BGMv2+shadow 26.61 0.7878 101.04 168.31 0.2/frame
MatAnyone+SAM 26.12 0.7868 100.46 146.44 0.3/frame
Gen-Omnimatte 24.35 0.6936 101.33 116.32 300
EasyOmnimatte 26.23 0.7883 100.94 105.48 9 (total)

5. Downstream Applications and Practical Use

EasyOmnimatte decomposition yields high-quality α\alpha mattes, separated foreground (FF), and clean background (BB), facilitating a variety of downstream tasks:

  • Recoloring: Foreground hue adjustment followed by α\alpha compositing.
  • Relighting/Shadows: Manipulate effect region E=F⊙αE = F \odot \alpha (e.g., shading, blurring, scaling).
  • Object removal: Background BB serves as hallucinated scene without the subject.
  • Background replacement: Compositing FF over alternate backgrounds.

The method demonstrates robust performance but exhibits limited sensitivity to very thin hair in motion blur and severe rolling shutter artifacts, as late-stage DiT blocks lose boundary detail and effect sensitivity (Hu et al., 26 Dec 2025).

6. Implementation Details and Reproducibility

For implementation:

  • DiT pretraining: T=200T = 200, linear β\beta schedule, as in Lee et al.
  • LoRA ranks: rE=128r_E = 128 (Effect Expert), rQ=64r_Q = 64 (Quality Expert).
  • Effect Expert: last K=4K=4 blocks.
  • Expert switching: Ï„=0.5\tau=0.5.
  • Training: AdamW (lr=1×10−3\text{lr}=1\times10^{-3}), 8k steps, batch size 1, 2×\timesH100 GPUs.

A plausible implication is that the adapter-swapping architecture further generalizes to other diffusion-based layered decomposition problems where selective expert specialization is beneficial. EasyOmnimatte reframes omnimatte inference as a one-pass, expert-switched denoising task that is computationally efficient, high fidelity, and immediately applicable to diverse video editing workflows (Hu et al., 26 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to EasyOmnimatte.