VLD-MMDiT: Variable Layers Decomposition
- The paper presents VLD-MMDiT, which decomposes an RGB image into a variable number of semantically disentangled RGBA layers using a flow-matching diffusion process.
- Its architecture integrates multi-modal attention and a novel 3D rotary positional encoding to achieve efficient, end-to-end layer decomposition without recursive peeling.
- Empirical results demonstrate state-of-the-art performance, with significant improvements in both quantitative metrics (e.g., soft-IoU) and qualitative layer separability.
Variable Layers Decomposition MMDiT (VLD-MMDiT) is a key architectural component within Qwen-Image-Layered, a diffusion-based image generation system designed for decomposing a single RGB image into a stack of multiple, semantically disentangled RGBA layers. Unlike traditional raster representations, which entangle all visual content into a single canvas, VLD-MMDiT leverages a conditional diffusion process with a shared RGBA-VAE latent space, providing variable-length layer decomposition in an efficient, end-to-end manner. This architecture supports inherent editability, with each output layer independently manipulable, and achieves state-of-the-art results in quantitative and qualitative image decomposition metrics (Yin et al., 17 Dec 2025).
1. Objectives and Problem Formulation
VLD-MMDiT addresses the task of converting an input image into a set of RGBA layers, , where each . The decomposition goal is to achieve semantic disentanglement such that edits to one layer do not compromise the integrity of the others. The core challenge is modeling this as a single-shot process, capable of supporting a variable number of layers, without relying on recursive foreground–background peeling or error-prone stepwise methods.
The flow-matching diffusion approach adopted in VLD-MMDiT operates in the unified latent space of an RGBA-VAE encoder and decoder . The architecture conditions on both a latent vector and an optional text descriptor ; these inputs inform the decomposition by guiding the denoising trajectory from isotropic Gaussian noise towards the true layer latents .
2. Architecture and Design Modifications
The VLD-MMDiT extends the standard Multi-Modal Diffusion Transformer (MMDiT) by incorporating innovations tailored for multi-layer decomposition:
A. Multi-Modal Attention over Layer Stacks:
The input image latent and the current noisy layer latent are patchified by a factor of 2× along each spatial dimension, converting them into tractable sequences. These, along with projected text prompt features (), are concatenated to form a multimodal sequence which is processed via a single MultiHeadAttention block:
This mechanism jointly fuses intra-layer, inter-layer, and cross-modal interactions, supporting dense conditioning and semantic alignment across layers.
B. Layer3D Rotary Positional Encoding (RoPE):
A third positional axis is introduced, supplementing standard positional encodings, where corresponds to the conditioning image latent and index each output layer. The 3D RoPE equips the attention module with the capability to attend dynamically over arbitrary layer counts, enabling sequence-length agnostic processing and robust generalization to variable .
3. Mathematical Foundations
The layer decomposition process in VLD-MMDiT utilizes the flow-matching paradigm, parametrized as follows:
- Latent Initialization:
- Noise Sampling: ,
- Time-t Interpolation and Velocity:
- Prediction: The network predicts given .
Loss Function:
No adversarial or perceptual losses are used within VLD-MMDiT; the training signal is exclusively driven by flow-matching reconstruction.
Attention Update Example:
Here, and carry the 3D RoPE embeddings for .
4. Integration with RGBA-VAE
The RGBA-VAE encoder-decoder pair defines the latent space for both input conditioning and output prediction:
- The encoder (first convolution expanded from 3 to 4 channels) processes RGB and RGBA images, ensuring unified latent embedding.
- The decoder (last convolution expanded similarly) reconstructs RGBA layers from their latent representations.
- Initial weights are set to and , with .
- The input image is encoded as , and output stack is encoded as .
- All operations within VLD-MMDiT are performed in this shared manifold, simplifying conditioning and reconstruction.
5. Multi-Stage Training Framework and Data Pipeline
The training of VLD-MMDiT involves three distinct phases:
| Stage | Objective | Data Source |
|---|---|---|
| 1 | Text-to-RGB and Text-to-RGBA generation | Large text–image corpus (no layers) |
| 2 | Text-to-Multi-RGBA (T2L): composite and multi-layer outputs () | PSD-extracted multilayer sets + generated captions |
| 3 | Image-to-Multi-RGBA (I2L): layer decomposition conditioned on | PSD-extracted multilayer sets |
The PSD extraction pipeline involves parsing raw Photoshop documents, discarding irrelevant or low-quality layers, merging non-overlapping layers to manage , and generating semantic captions using Qwen2.5-VL.
6. Training and Inference Details
The core protocol for VLD-MMDiT training includes:
- Batch sampling from PSD-derived datasets .
- Encoding and into and , respectively, via RGBA-VAE .
- Noise and time parameter sampling (, ), followed by layer latent interpolation.
- Forwarding through VLD-MMDiT (with , , inputs).
- Computing and applying the flow-matching loss.
- Optimizer: Adam with learning rate .
- Batch size: 128.
- Max layer count .
- Training steps: 500 K for Stage 1, 400 K for Stage 2 and Stage 3.
- RGBA-VAE latent dimensions: channels, .
- Patchification: 2× per spatial dimension.
- Attention heads: 16, with head dimension 64.
- Layer3D RoPE: .
Sampling (Inference) Algorithm:
- Encode input to
- Initialize
- Iteratively update from to :
- Predict velocity
- Update latent using rectified flow integrator
- Decode final layer latent via to obtain
7. Empirical Performance and Ablation
VLD-MMDiT demonstrates superior quantitative performance on the Crello dataset relative to leading baselines, as shown:
| Method | RGB (weighted by ) | soft-IoU |
|---|---|---|
| LayerD | 0.0709 | 0.7520 |
| Qwen-Image-Layered-I2L | 0.0594 () | 0.8705 () |
Ablation studies confirm the necessity of Layer3D RoPE, RGBA-VAE, and multi-stage training:
- Removal of Layer3D RoPE causes failure to distinguish layers (RGB ).
- Omitting RGBA-VAE manifests a persistent gap (RGB , IoU ).
- Excluding multi-stage training yields suboptimal adaptation (RGB , IoU ).
Qualitative gains include elimination of recursive peeling errors, improved content separation, and efficient, robust support for decompositions ranging from one to twenty layers. The architecture is end-to-end trainable and computationally efficient, establishing a new paradigm for image editing via inherent layerwise disentanglement (Yin et al., 17 Dec 2025).