EasyOmnimatte: Efficient Video Decomposition

Updated 2 January 2026

EasyOmnimatte is a video layered decomposition framework that separates foreground, scene effects, and background in one streamlined process.
It utilizes a pretrained video inpainting diffusion transformer with LoRA finetuning and a dual-expert mechanism to improve effect extraction and matte quality.
Experimental results demonstrate competitive PSNR, SSIM, and FVD metrics while significantly reducing computational cost for various video editing applications.

EasyOmnimatte is an end-to-end video layered decomposition framework designed to extract foreground objects, their associated scene effects (such as shadows and reflections), and the background from videos in a single, unified pass. Unlike previous omnimatte approaches, which employ slow, multi-stage, or inference-time optimization pipelines, EasyOmnimatte leverages finetuned video inpainting diffusion models and a novel dual-expert mechanism to achieve high-quality omnimatte decomposition with reduced computational cost and minimal user effort (Hu et al., 26 Dec 2025).

1. Pretrained Diffusion Backbone and Data Representation

EasyOmnimatte is built upon a pretrained video inpainting diffusion transformer (DiT) backbone. The network operates on sequences of $N$ video frames, $V \in \mathbb{R}^{N \times H \times W \times 3}$ , accompanied by coarse per-frame foreground masks $M \in \{0,1\}^{N \times H \times W}$ and optional text/background conditioning $c$ . The DDPM framework forms the basis for generative modeling, with pixel or latent space formulations,

$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$

and denoising performed via stacked DiT blocks $L_1, ..., L_B$ containing multi-head self-attention and feed-forward networks. Pretraining targets object and effect removal, optimizing

$L_\text{pretrain} = \mathbb{E}_{x, M, c, t, \epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, M, c, t)\|^2 \right]$

to capture generative priors pertinent to object-effect separation (Hu et al., 26 Dec 2025).

2. LoRA Finetuning and Dual-Expert Architecture

Central to EasyOmnimatte is the application of Low-Rank Adaptation (LoRA) to the DiT blocks. LoRA introduces a trainable, rank- $r$ update $\Delta W_b = A_b B_b$ that supplements the frozen block weights $W_b$ with minimal overhead. The dual-expert strategy is motivated by the empirical observation that effect cues are encoded only in selected late-stage blocks, and indiscriminate LoRA fine-tuning suppresses effect extraction.

Effect Expert ( $\theta^E$ ): LoRA adapters are applied exclusively to the last $K$ effect-sensitive blocks ( $B_\text{effect} = \{b \mid b > B - K\}$ ), focusing on coarse foreground and effect decomposition.
Quality Expert ( $\theta^Q$ ): LoRA adapters are applied to all blocks ( $B_\text{quality} = \{1 ... B\}$ ), refining alpha matte details.

Weight updates for each expert are:

$W_b^{(E)} = \begin{cases} W_b + A_b^E B_b^E, & b \in B_\text{effect} \ W_b, & \text{otherwise} \end{cases}$

$W_b^{(Q)} = W_b + A_b^Q B_b^Q, \qquad \forall b$

This division enables selective adaptation during inference, yielding higher-quality layered decompositions than prior approaches (Hu et al., 26 Dec 2025).

3. Training Objectives and Inference Procedure

Both experts are trained concurrently with synthetic matting datasets employing three loss components:

Background inpainting loss: $L_\text{bg} = \mathbb{E} \| \hat{H} - H_\text{gt} \|_1$
Alpha-matting loss:

$L_\alpha = \mathbb{E} [ \|\hat{\alpha} - \alpha_\text{gt}\|_1 + \lambda_\text{perm} L_\text{perc}(\hat{\alpha}, \alpha_\text{gt}) ]$

Compositing consistency regularizer:

$L_\text{comp} = \mathbb{E} \| V - (\hat{\alpha} \odot \hat{F} + (1 - \hat{\alpha}) \odot \hat{H}) \|_1$

Total expert loss:

$L_\text{total} = L_\text{bg} + \lambda_\alpha L_\alpha + \lambda_c L_\text{comp}$

With hyperparameters: $\lambda_\alpha = 1.0$ , $\lambda_\text{perm} = 0.1$ , $\lambda_c = 0.1$ .

The inference schedule employs a linear $\beta_t$ schedule across $T=200$ timesteps. For $t = T ... 1$ , expert switching occurs at $T_\text{switch} = \lfloor \tau T \rfloor$ ( $\tau=0.5$ ):

$\theta_t = \begin{cases} \theta^E, & t > T_\text{switch} \ \theta^Q, & t \leq T_\text{switch} \end{cases}$

Each denoising step invokes the corresponding expert, producing $\hat{\epsilon}_t$ and updating $x_{t-1}$ with negligible computational cost due to adapter-swapping rather than full network reload. This enables the Effect Expert to capture structure and effects in early, noisy steps, followed by the Quality Expert's detailed refinement (Hu et al., 26 Dec 2025).

4. Experimental Results and Ablation Analysis

Comprehensive ablations validate the dual-expert model. Performance metrics include PSNR, SSIM, intersection-over-union ( $\text{IoU}_\alpha$ ) for alpha mattes, and FVD (Frechet Video Distance).

Table: LoRA Block Selection and Rank (Synthetic Set)

Method	PSNR ↑	SSIM ↑	IoU α ↑	FVD ↓
Quality-only	25.8	0.77	0.68	120
Effect-only	24.5	0.72	0.64	135
Dual-Expert	26.2	0.79	0.71	105

Varying $\tau$ (the expert-switch threshold) reveals optimal trade-offs on foreground/effect MSE.

Table: $\tau$ Effects on MSE

τ	MSE_fg	MSE_eff
0.2	0.0045	0.0123
0.5	0.0040	0.0101
0.8	0.0043	0.0092

Full quantitative comparisons to SOTA methods (e.g., BGMv2+shadow, MatAnyone+SAM, Gen-Omnimatte) demonstrate EasyOmnimatte achieves competitive PSNR (26.23), SSIM (0.7883), minimal warp error (100.94), and lowest FVD (105.48), with a wall-clock inference time of 9 seconds total (versus minutes per frame in earlier pipelines).

Table: SOTA Comparison

Method	PSNR	SSIM	WE	FVD	Time (s)
BGMv2+shadow	26.61	0.7878	101.04	168.31	0.2/frame
MatAnyone+SAM	26.12	0.7868	100.46	146.44	0.3/frame
Gen-Omnimatte	24.35	0.6936	101.33	116.32	300
EasyOmnimatte	26.23	0.7883	100.94	105.48	9 (total)

5. Downstream Applications and Practical Use

EasyOmnimatte decomposition yields high-quality $\alpha$ mattes, separated foreground ( $F$ ), and clean background ( $B$ ), facilitating a variety of downstream tasks:

Recoloring: Foreground hue adjustment followed by $\alpha$ compositing.
Relighting/Shadows: Manipulate effect region $E = F \odot \alpha$ (e.g., shading, blurring, scaling).
Object removal: Background $B$ serves as hallucinated scene without the subject.
Background replacement: Compositing $F$ over alternate backgrounds.

The method demonstrates robust performance but exhibits limited sensitivity to very thin hair in motion blur and severe rolling shutter artifacts, as late-stage DiT blocks lose boundary detail and effect sensitivity (Hu et al., 26 Dec 2025).

6. Implementation Details and Reproducibility

For implementation:

DiT pretraining: $T = 200$ , linear $\beta$ schedule, as in Lee et al.
LoRA ranks: $r_E = 128$ (Effect Expert), $r_Q = 64$ (Quality Expert).
Effect Expert: last $K=4$ blocks.
Expert switching: $\tau=0.5$ .
Training: AdamW ( $\text{lr}=1\times10^{-3}$ ), 8k steps, batch size 1, 2 $\times$ H100 GPUs.

A plausible implication is that the adapter-swapping architecture further generalizes to other diffusion-based layered decomposition problems where selective expert specialization is beneficial. EasyOmnimatte reframes omnimatte inference as a one-pass, expert-switched denoising task that is computationally efficient, high fidelity, and immediately applicable to diverse video editing workflows (Hu et al., 26 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to EasyOmnimatte.