EasyOmnimatte: Efficient Video Decomposition
- EasyOmnimatte is a video layered decomposition framework that separates foreground, scene effects, and background in one streamlined process.
- It utilizes a pretrained video inpainting diffusion transformer with LoRA finetuning and a dual-expert mechanism to improve effect extraction and matte quality.
- Experimental results demonstrate competitive PSNR, SSIM, and FVD metrics while significantly reducing computational cost for various video editing applications.
EasyOmnimatte is an end-to-end video layered decomposition framework designed to extract foreground objects, their associated scene effects (such as shadows and reflections), and the background from videos in a single, unified pass. Unlike previous omnimatte approaches, which employ slow, multi-stage, or inference-time optimization pipelines, EasyOmnimatte leverages finetuned video inpainting diffusion models and a novel dual-expert mechanism to achieve high-quality omnimatte decomposition with reduced computational cost and minimal user effort (Hu et al., 26 Dec 2025).
1. Pretrained Diffusion Backbone and Data Representation
EasyOmnimatte is built upon a pretrained video inpainting diffusion transformer (DiT) backbone. The network operates on sequences of video frames, , accompanied by coarse per-frame foreground masks and optional text/background conditioning . The DDPM framework forms the basis for generative modeling, with pixel or latent space formulations,
and denoising performed via stacked DiT blocks containing multi-head self-attention and feed-forward networks. Pretraining targets object and effect removal, optimizing
to capture generative priors pertinent to object-effect separation (Hu et al., 26 Dec 2025).
2. LoRA Finetuning and Dual-Expert Architecture
Central to EasyOmnimatte is the application of Low-Rank Adaptation (LoRA) to the DiT blocks. LoRA introduces a trainable, rank- update that supplements the frozen block weights with minimal overhead. The dual-expert strategy is motivated by the empirical observation that effect cues are encoded only in selected late-stage blocks, and indiscriminate LoRA fine-tuning suppresses effect extraction.
- Effect Expert (): LoRA adapters are applied exclusively to the last effect-sensitive blocks (), focusing on coarse foreground and effect decomposition.
- Quality Expert (): LoRA adapters are applied to all blocks (), refining alpha matte details.
Weight updates for each expert are:
This division enables selective adaptation during inference, yielding higher-quality layered decompositions than prior approaches (Hu et al., 26 Dec 2025).
3. Training Objectives and Inference Procedure
Both experts are trained concurrently with synthetic matting datasets employing three loss components:
- Background inpainting loss:
- Alpha-matting loss:
- Compositing consistency regularizer:
Total expert loss:
With hyperparameters: , , .
The inference schedule employs a linear schedule across timesteps. For , expert switching occurs at ():
Each denoising step invokes the corresponding expert, producing and updating with negligible computational cost due to adapter-swapping rather than full network reload. This enables the Effect Expert to capture structure and effects in early, noisy steps, followed by the Quality Expert's detailed refinement (Hu et al., 26 Dec 2025).
4. Experimental Results and Ablation Analysis
Comprehensive ablations validate the dual-expert model. Performance metrics include PSNR, SSIM, intersection-over-union () for alpha mattes, and FVD (Frechet Video Distance).
Table: LoRA Block Selection and Rank (Synthetic Set)
| Method | PSNR ↑ | SSIM ↑ | IoU α ↑ | FVD ↓ |
|---|---|---|---|---|
| Quality-only | 25.8 | 0.77 | 0.68 | 120 |
| Effect-only | 24.5 | 0.72 | 0.64 | 135 |
| Dual-Expert | 26.2 | 0.79 | 0.71 | 105 |
Varying (the expert-switch threshold) reveals optimal trade-offs on foreground/effect MSE.
Table: Effects on MSE
| Ï„ | MSE_fg | MSE_eff |
|---|---|---|
| 0.2 | 0.0045 | 0.0123 |
| 0.5 | 0.0040 | 0.0101 |
| 0.8 | 0.0043 | 0.0092 |
Full quantitative comparisons to SOTA methods (e.g., BGMv2+shadow, MatAnyone+SAM, Gen-Omnimatte) demonstrate EasyOmnimatte achieves competitive PSNR (26.23), SSIM (0.7883), minimal warp error (100.94), and lowest FVD (105.48), with a wall-clock inference time of 9 seconds total (versus minutes per frame in earlier pipelines).
Table: SOTA Comparison
| Method | PSNR | SSIM | WE | FVD | Time (s) |
|---|---|---|---|---|---|
| BGMv2+shadow | 26.61 | 0.7878 | 101.04 | 168.31 | 0.2/frame |
| MatAnyone+SAM | 26.12 | 0.7868 | 100.46 | 146.44 | 0.3/frame |
| Gen-Omnimatte | 24.35 | 0.6936 | 101.33 | 116.32 | 300 |
| EasyOmnimatte | 26.23 | 0.7883 | 100.94 | 105.48 | 9 (total) |
5. Downstream Applications and Practical Use
EasyOmnimatte decomposition yields high-quality mattes, separated foreground (), and clean background (), facilitating a variety of downstream tasks:
- Recoloring: Foreground hue adjustment followed by compositing.
- Relighting/Shadows: Manipulate effect region (e.g., shading, blurring, scaling).
- Object removal: Background serves as hallucinated scene without the subject.
- Background replacement: Compositing over alternate backgrounds.
The method demonstrates robust performance but exhibits limited sensitivity to very thin hair in motion blur and severe rolling shutter artifacts, as late-stage DiT blocks lose boundary detail and effect sensitivity (Hu et al., 26 Dec 2025).
6. Implementation Details and Reproducibility
For implementation:
- DiT pretraining: , linear schedule, as in Lee et al.
- LoRA ranks: (Effect Expert), (Quality Expert).
- Effect Expert: last blocks.
- Expert switching: .
- Training: AdamW (), 8k steps, batch size 1, 2H100 GPUs.
A plausible implication is that the adapter-swapping architecture further generalizes to other diffusion-based layered decomposition problems where selective expert specialization is beneficial. EasyOmnimatte reframes omnimatte inference as a one-pass, expert-switched denoising task that is computationally efficient, high fidelity, and immediately applicable to diverse video editing workflows (Hu et al., 26 Dec 2025).