VLD-MMDiT: Variable Layers Decomposition

Updated 19 December 2025

The paper presents VLD-MMDiT, which decomposes an RGB image into a variable number of semantically disentangled RGBA layers using a flow-matching diffusion process.
Its architecture integrates multi-modal attention and a novel 3D rotary positional encoding to achieve efficient, end-to-end layer decomposition without recursive peeling.
Empirical results demonstrate state-of-the-art performance, with significant improvements in both quantitative metrics (e.g., soft-IoU) and qualitative layer separability.

Variable Layers Decomposition MMDiT (VLD-MMDiT) is a key architectural component within Qwen-Image-Layered, a diffusion-based image generation system designed for decomposing a single RGB image into a stack of multiple, semantically disentangled RGBA layers. Unlike traditional raster representations, which entangle all visual content into a single canvas, VLD-MMDiT leverages a conditional diffusion process with a shared RGBA-VAE latent space, providing variable-length layer decomposition in an efficient, end-to-end manner. This architecture supports inherent editability, with each output layer independently manipulable, and achieves state-of-the-art results in quantitative and qualitative image decomposition metrics (Yin et al., 17 Dec 2025).

1. Objectives and Problem Formulation

VLD-MMDiT addresses the task of converting an input image $I \in \mathbb{R}^{H \times W \times 3}$ into a set of $N$ RGBA layers, $L = [L_1, \ldots, L_N]$ , where each $L_i \in \mathbb{R}^{H \times W \times 4}$ . The decomposition goal is to achieve semantic disentanglement such that edits to one layer do not compromise the integrity of the others. The core challenge is modeling this as a single-shot process, capable of supporting a variable number $N$ of layers, without relying on recursive foreground–background peeling or error-prone stepwise methods.

The flow-matching diffusion approach adopted in VLD-MMDiT operates in the unified latent space of an RGBA-VAE encoder $\mathcal{E}$ and decoder $\mathcal{D}$ . The architecture conditions on both a latent vector $z_I = \mathcal{E}(I)$ and an optional text descriptor $h$ ; these inputs inform the decomposition by guiding the denoising trajectory from isotropic Gaussian noise $x_1 \sim \mathcal{N}(0, I)$ towards the true layer latents $x_0 = \mathcal{E}(L)$ .

2. Architecture and Design Modifications

The VLD-MMDiT extends the standard Multi-Modal Diffusion Transformer (MMDiT) by incorporating innovations tailored for multi-layer decomposition:

A. Multi-Modal Attention over Layer Stacks:

The input image latent $z_I$ and the current noisy layer latent $x_t$ are patchified by a factor of 2× along each spatial dimension, converting them into tractable sequences. These, along with projected text prompt features ( $Q^{\text{txt}} = \text{project}(h)$ ), are concatenated to form a multimodal sequence which is processed via a single MultiHeadAttention block:

$\text{Attn}([Q^{\text{img}}; Q^{\text{lay}}; Q^{\text{txt}}])$

This mechanism jointly fuses intra-layer, inter-layer, and cross-modal interactions, supporting dense conditioning and semantic alignment across layers.

B. Layer3D Rotary Positional Encoding (RoPE):

A third positional axis $\ell \in \{-1, 0, \ldots, N-1\}$ is introduced, supplementing standard $(x, y)$ positional encodings, where $\ell = -1$ corresponds to the conditioning image latent and $\ell = 0\dots N-1$ index each output layer. The 3D RoPE equips the attention module with the capability to attend dynamically over arbitrary layer counts, enabling sequence-length agnostic processing and robust generalization to variable $N$ .

3. Mathematical Foundations

The layer decomposition process in VLD-MMDiT utilizes the flow-matching paradigm, parametrized as follows:

Latent Initialization: $x_0 = \mathcal{E}(L) \in \mathbb{R}^{N \times h \times w \times c}$
Noise Sampling: $x_1 \sim \mathcal{N}(0, I)$ , $t \sim \text{LogitNormal}$
Time-t Interpolation and Velocity:

$x_t = t \cdot x_0 + (1-t) \cdot x_1,\quad v_t = \frac{dx_t}{dt} = x_0 - x_1$

Prediction: The network $v_\theta$ predicts $v_t$ given $(x_t, t, z_I, h)$ .

Loss Function:

$\mathcal{L}_{FM} = \mathbb{E}_{x_0, x_1, t, z_I, h} \big\| v_\theta(x_t, t, z_I, h) - v_t \big\|_2^2$

No adversarial or perceptual losses are used within VLD-MMDiT; the training signal is exclusively driven by flow-matching reconstruction.

Attention Update Example:

$\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d}}\right) V$

Here, $Q$ and $K$ carry the 3D RoPE embeddings for $(x, y, \ell)$ .

4. Integration with RGBA-VAE

The RGBA-VAE encoder-decoder pair defines the latent space for both input conditioning and output prediction:

The encoder $\mathcal{E}$ (first convolution expanded from 3 to 4 channels) processes RGB and RGBA images, ensuring unified latent embedding.
The decoder $\mathcal{D}$ (last convolution expanded similarly) reconstructs RGBA layers from their latent representations.
Initial weights are set to $W_{\mathcal{E}}[:,3,:,:,:]=0$ and $W_{\mathcal{D}}[3,:,:,:,:]=0$ , with $b_{\mathcal{D}}[3]=1$ .
The input image $I$ is encoded as $z_I = \mathcal{E}(I)$ , and output stack $L$ is encoded as $x_0 = \mathcal{E}(L)$ .
All operations within VLD-MMDiT are performed in this shared manifold, simplifying conditioning and reconstruction.

5. Multi-Stage Training Framework and Data Pipeline

The training of VLD-MMDiT involves three distinct phases:

Stage	Objective	Data Source
1	Text-to-RGB and Text-to-RGBA generation	Large text–image corpus (no layers)
2	Text-to-Multi-RGBA (T2L): composite and multi-layer outputs ( $N\leq20$ )	PSD-extracted multilayer sets + generated captions
3	Image-to-Multi-RGBA (I2L): layer decomposition conditioned on $I$	PSD-extracted multilayer sets

The PSD extraction pipeline involves parsing raw Photoshop documents, discarding irrelevant or low-quality layers, merging non-overlapping layers to manage $N$ , and generating semantic captions using Qwen2.5-VL.

6. Training and Inference Details

The core protocol for VLD-MMDiT training includes:

Batch sampling from PSD-derived datasets $(I, L)$ .
Encoding $I$ and $L$ into $z_I$ and $x_0$ , respectively, via RGBA-VAE $\mathcal{E}$ .
Noise and time parameter sampling ( $x_1$ , $t$ ), followed by layer latent interpolation.
Forwarding $x_t$ through VLD-MMDiT (with $z_I$ , $h$ , $t$ inputs).
Computing and applying the flow-matching loss.
Optimizer: Adam with learning rate $1\mathrm{e}{-5}$ .
Batch size: 128.
Max layer count $N_{\text{max}}=20$ .
Training steps: 500 K for Stage 1, 400 K for Stage 2 and Stage 3.
RGBA-VAE latent dimensions: $c=4$ channels, $h, w = \frac{1}{8} (H, W)$ .
Patchification: 2× per spatial dimension.
Attention heads: 16, with head dimension 64.
Layer3D RoPE: $\sin(\frac{x}{2\pi}),\, \cos(\frac{y}{2\pi}),\, \sin(\frac{\ell}{\pi}),\, \cos(\frac{\ell}{\pi})$ .

Sampling (Inference) Algorithm:

Encode input $I$ to $z_I$
Initialize $x_N \sim \mathcal{N}(0,I)$
Iteratively update from $t=N$ $t = N$ to $t=1$ $t = 1$ :
- Predict velocity $v_\theta(x_t, t, z_I, h)$
- Update latent $x_{t-1} = x_t + (\Delta t)\cdot v_\theta$ using rectified flow integrator
Decode final layer latent $x_0$ via $\mathcal{D}$ to obtain $L$

7. Empirical Performance and Ablation

VLD-MMDiT demonstrates superior quantitative performance on the Crello dataset relative to leading baselines, as shown:

Method	RGB $L_1$ (weighted by $\alpha$ )	$\alpha$ soft-IoU
LayerD	0.0709	0.7520
Qwen-Image-Layered-I2L	0.0594 ( $-16\%$ )	0.8705 ( $+16\%$ )

Ablation studies confirm the necessity of Layer3D RoPE, RGBA-VAE, and multi-stage training:

Removal of Layer3D RoPE causes failure to distinguish layers (RGB $L_1 \approx 0.28$ ).
Omitting RGBA-VAE manifests a persistent gap (RGB $L_1 \approx 0.19$ , $\alpha$ IoU $\approx 0.58$ ).
Excluding multi-stage training yields suboptimal adaptation (RGB $L_1 \approx 0.16$ , $\alpha$ IoU $\approx 0.65$ ).

Qualitative gains include elimination of recursive peeling errors, improved content separation, and efficient, robust support for decompositions ranging from one to twenty layers. The architecture is end-to-end trainable and computationally efficient, establishing a new paradigm for image editing via inherent layerwise disentanglement (Yin et al., 17 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Variable Layers Decomposition MMDiT (VLD-MMDiT).