Papers
Topics
Authors
Recent
2000 character limit reached

VLD-MMDiT: Variable Layers Decomposition

Updated 19 December 2025
  • The paper presents VLD-MMDiT, which decomposes an RGB image into a variable number of semantically disentangled RGBA layers using a flow-matching diffusion process.
  • Its architecture integrates multi-modal attention and a novel 3D rotary positional encoding to achieve efficient, end-to-end layer decomposition without recursive peeling.
  • Empirical results demonstrate state-of-the-art performance, with significant improvements in both quantitative metrics (e.g., soft-IoU) and qualitative layer separability.

Variable Layers Decomposition MMDiT (VLD-MMDiT) is a key architectural component within Qwen-Image-Layered, a diffusion-based image generation system designed for decomposing a single RGB image into a stack of multiple, semantically disentangled RGBA layers. Unlike traditional raster representations, which entangle all visual content into a single canvas, VLD-MMDiT leverages a conditional diffusion process with a shared RGBA-VAE latent space, providing variable-length layer decomposition in an efficient, end-to-end manner. This architecture supports inherent editability, with each output layer independently manipulable, and achieves state-of-the-art results in quantitative and qualitative image decomposition metrics (Yin et al., 17 Dec 2025).

1. Objectives and Problem Formulation

VLD-MMDiT addresses the task of converting an input image IRH×W×3I \in \mathbb{R}^{H \times W \times 3} into a set of NN RGBA layers, L=[L1,,LN]L = [L_1, \ldots, L_N], where each LiRH×W×4L_i \in \mathbb{R}^{H \times W \times 4}. The decomposition goal is to achieve semantic disentanglement such that edits to one layer do not compromise the integrity of the others. The core challenge is modeling this as a single-shot process, capable of supporting a variable number NN of layers, without relying on recursive foreground–background peeling or error-prone stepwise methods.

The flow-matching diffusion approach adopted in VLD-MMDiT operates in the unified latent space of an RGBA-VAE encoder E\mathcal{E} and decoder D\mathcal{D}. The architecture conditions on both a latent vector zI=E(I)z_I = \mathcal{E}(I) and an optional text descriptor hh; these inputs inform the decomposition by guiding the denoising trajectory from isotropic Gaussian noise x1N(0,I)x_1 \sim \mathcal{N}(0, I) towards the true layer latents x0=E(L)x_0 = \mathcal{E}(L).

2. Architecture and Design Modifications

The VLD-MMDiT extends the standard Multi-Modal Diffusion Transformer (MMDiT) by incorporating innovations tailored for multi-layer decomposition:

A. Multi-Modal Attention over Layer Stacks:

The input image latent zIz_I and the current noisy layer latent xtx_t are patchified by a factor of 2× along each spatial dimension, converting them into tractable sequences. These, along with projected text prompt features (Qtxt=project(h)Q^{\text{txt}} = \text{project}(h)), are concatenated to form a multimodal sequence which is processed via a single MultiHeadAttention block:

Attn([Qimg;Qlay;Qtxt])\text{Attn}([Q^{\text{img}}; Q^{\text{lay}}; Q^{\text{txt}}])

This mechanism jointly fuses intra-layer, inter-layer, and cross-modal interactions, supporting dense conditioning and semantic alignment across layers.

B. Layer3D Rotary Positional Encoding (RoPE):

A third positional axis {1,0,,N1}\ell \in \{-1, 0, \ldots, N-1\} is introduced, supplementing standard (x,y)(x, y) positional encodings, where =1\ell = -1 corresponds to the conditioning image latent and =0N1\ell = 0\dots N-1 index each output layer. The 3D RoPE equips the attention module with the capability to attend dynamically over arbitrary layer counts, enabling sequence-length agnostic processing and robust generalization to variable NN.

3. Mathematical Foundations

The layer decomposition process in VLD-MMDiT utilizes the flow-matching paradigm, parametrized as follows:

  • Latent Initialization: x0=E(L)RN×h×w×cx_0 = \mathcal{E}(L) \in \mathbb{R}^{N \times h \times w \times c}
  • Noise Sampling: x1N(0,I)x_1 \sim \mathcal{N}(0, I), tLogitNormalt \sim \text{LogitNormal}
  • Time-t Interpolation and Velocity:

xt=tx0+(1t)x1,vt=dxtdt=x0x1x_t = t \cdot x_0 + (1-t) \cdot x_1,\quad v_t = \frac{dx_t}{dt} = x_0 - x_1

  • Prediction: The network vθv_\theta predicts vtv_t given (xt,t,zI,h)(x_t, t, z_I, h).

Loss Function:

LFM=Ex0,x1,t,zI,hvθ(xt,t,zI,h)vt22\mathcal{L}_{FM} = \mathbb{E}_{x_0, x_1, t, z_I, h} \big\| v_\theta(x_t, t, z_I, h) - v_t \big\|_2^2

No adversarial or perceptual losses are used within VLD-MMDiT; the training signal is exclusively driven by flow-matching reconstruction.

Attention Update Example:

Attn(Q,K,V)=softmax(QKTd)V\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d}}\right) V

Here, QQ and KK carry the 3D RoPE embeddings for (x,y,)(x, y, \ell).

4. Integration with RGBA-VAE

The RGBA-VAE encoder-decoder pair defines the latent space for both input conditioning and output prediction:

  • The encoder E\mathcal{E} (first convolution expanded from 3 to 4 channels) processes RGB and RGBA images, ensuring unified latent embedding.
  • The decoder D\mathcal{D} (last convolution expanded similarly) reconstructs RGBA layers from their latent representations.
  • Initial weights are set to WE[:,3,:,:,:]=0W_{\mathcal{E}}[:,3,:,:,:]=0 and WD[3,:,:,:,:]=0W_{\mathcal{D}}[3,:,:,:,:]=0, with bD[3]=1b_{\mathcal{D}}[3]=1.
  • The input image II is encoded as zI=E(I)z_I = \mathcal{E}(I), and output stack LL is encoded as x0=E(L)x_0 = \mathcal{E}(L).
  • All operations within VLD-MMDiT are performed in this shared manifold, simplifying conditioning and reconstruction.

5. Multi-Stage Training Framework and Data Pipeline

The training of VLD-MMDiT involves three distinct phases:

Stage Objective Data Source
1 Text-to-RGB and Text-to-RGBA generation Large text–image corpus (no layers)
2 Text-to-Multi-RGBA (T2L): composite and multi-layer outputs (N20N\leq20) PSD-extracted multilayer sets + generated captions
3 Image-to-Multi-RGBA (I2L): layer decomposition conditioned on II PSD-extracted multilayer sets

The PSD extraction pipeline involves parsing raw Photoshop documents, discarding irrelevant or low-quality layers, merging non-overlapping layers to manage NN, and generating semantic captions using Qwen2.5-VL.

6. Training and Inference Details

The core protocol for VLD-MMDiT training includes:

  • Batch sampling from PSD-derived datasets (I,L)(I, L).
  • Encoding II and LL into zIz_I and x0x_0, respectively, via RGBA-VAE E\mathcal{E}.
  • Noise and time parameter sampling (x1x_1, tt), followed by layer latent interpolation.
  • Forwarding xtx_t through VLD-MMDiT (with zIz_I, hh, tt inputs).
  • Computing and applying the flow-matching loss.
  • Optimizer: Adam with learning rate 1e51\mathrm{e}{-5}.
  • Batch size: 128.
  • Max layer count Nmax=20N_{\text{max}}=20.
  • Training steps: 500 K for Stage 1, 400 K for Stage 2 and Stage 3.
  • RGBA-VAE latent dimensions: c=4c=4 channels, h,w=18(H,W)h, w = \frac{1}{8} (H, W).
  • Patchification: 2× per spatial dimension.
  • Attention heads: 16, with head dimension 64.
  • Layer3D RoPE: sin(x2π),cos(y2π),sin(π),cos(π)\sin(\frac{x}{2\pi}),\, \cos(\frac{y}{2\pi}),\, \sin(\frac{\ell}{\pi}),\, \cos(\frac{\ell}{\pi}).

Sampling (Inference) Algorithm:

  • Encode input II to zIz_I
  • Initialize xNN(0,I)x_N \sim \mathcal{N}(0,I)
  • Iteratively update from t=Nt=N to t=1t=1:
    • Predict velocity vθ(xt,t,zI,h)v_\theta(x_t, t, z_I, h)
    • Update latent xt1=xt+(Δt)vθx_{t-1} = x_t + (\Delta t)\cdot v_\theta using rectified flow integrator
  • Decode final layer latent x0x_0 via D\mathcal{D} to obtain LL

7. Empirical Performance and Ablation

VLD-MMDiT demonstrates superior quantitative performance on the Crello dataset relative to leading baselines, as shown:

Method RGB L1L_1 (weighted by α\alpha) α\alpha soft-IoU
LayerD 0.0709 0.7520
Qwen-Image-Layered-I2L 0.0594 (16%-16\%) 0.8705 (+16%+16\%)

Ablation studies confirm the necessity of Layer3D RoPE, RGBA-VAE, and multi-stage training:

  • Removal of Layer3D RoPE causes failure to distinguish layers (RGB L10.28L_1 \approx 0.28).
  • Omitting RGBA-VAE manifests a persistent gap (RGB L10.19L_1 \approx 0.19, α\alpha IoU 0.58\approx 0.58).
  • Excluding multi-stage training yields suboptimal adaptation (RGB L10.16L_1 \approx 0.16, α\alpha IoU 0.65\approx 0.65).

Qualitative gains include elimination of recursive peeling errors, improved content separation, and efficient, robust support for decompositions ranging from one to twenty layers. The architecture is end-to-end trainable and computationally efficient, establishing a new paradigm for image editing via inherent layerwise disentanglement (Yin et al., 17 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Variable Layers Decomposition MMDiT (VLD-MMDiT).