Controllable Layer Decomposition (CLD)

Updated 4 May 2026

CLD is a neural network technique that decomposes composite visual inputs into semantically meaningful layers with user-directed control.
It integrates convolutional, transformer, and diffusion models to perform granular, prompt-conditioned separation for tasks like restoration and editing.
Applications range from image restoration and design workflows to generative modeling, enabling selective content manipulation and improved creative workflows.

Controllable Layer Decomposition (CLD) refers to a family of neural network techniques and architectures that enable the decomposition of images, designs, or videos into distinct, semantically meaningful layers with direct user or prompt-driven control over the separation process. This paradigm extends traditional blind decomposition and matting by introducing mechanisms to determine not only which components are extracted, but also how many, which types, and with what fidelity, supporting downstream tasks such as selective restoration, editing, or content creation. The methodologies span convolutional, transformer, and diffusion-based models, with applications across image restoration, design workflows, generative modeling, and compositional editing.

1. Problem Statement, Motivation, and Scope

CLD aims to infer a set of layers from an observed, typically composite, visual input such that each layer is editable or addressable according to user-specified criteria. The central goal is to recover per-layer representations $(L_i)$ —frequently in RGBA or multi-channel feature space—whose recomposition exactly or closely reproduces the input, while exposing control over the granularity, nature, and semantic meaning of each layer (Zhang et al., 2024, Suzuki et al., 29 Sep 2025, Liu et al., 20 Nov 2025, Chen et al., 22 Feb 2026). The settings addressed include:

Image and video restoration from multi-degraded or blended inputs, where degradations (e.g., haze, watermark, or reflection) may be selectively retained or removed.
Extraction of graphic-design or illustration workflow layers (e.g., line art, flat color, shadow, highlight).
Instance-wise and amodal scene decomposition for generative modeling, inpainting, and editing tasks.
Prompt-conditioned extraction, where layers are selected or generated in response to user clicks, bounding boxes, masks, or language instructions.

Contrasted with classical matting or segmentation, CLD emphasizes controllability: explicit user or prompt-based steering of what and how decomposition occurs, frequently via binary or learned control vectors, spatial prompt images, or natural language queries.

2. Core Architectural Mechanisms

Although CLD approaches vary in their specifics, several canonical architectural elements appear recurrently:

Feature Split and Recombination: In the CBDNet framework, a U-Net or Restormer* encoder produces a high-dimensional feature tensor $F_d$ . The CLD block decomposes $F_d$ into $N$ channel-slices $F_i$ , each associated with a presumed component or degradation. A user- or prompt-configured control vector $c\in\{0,1\}^N$ determines which slices are recombined to form fused features for decoding. Both split and recombination are parameter-free operations (Zhang et al., 2024).

# Simplified CLD as in CBDNet
def CLD_Decompose_Recombine(F, c):
    B = Cd // N  # channels per branch
    F_i = [F[:,:,i*B:(i+1)*B] for i in range(N)]
    Fr = sum(ci * Fi for ci, Fi in zip(c, F_i))
    return Fr

Controllability Interface: User instructions, either by checklist or textual prompt, are translated into the control vector $c$ . This step may use a small classifier CNN to suggest which degradations are present and a prompt converter (e.g., BERT+FC) for free-form language commands (Zhang et al., 2024).
Layered Generation/Decomposition Transformers: Solutions such as LayerDecompose-DiT (Liu et al., 20 Nov 2025), LayeringDiff (Kang et al., 2 Jan 2025), and RefLayer (Chen et al., 22 Feb 2026) employ transformer backbones to jointly process latent tokens representing the composite image and layer proposals, augmented with hierarchical position embeddings or per-layer semantic embeddings to disambiguate attention across layers. Conditional adapters (e.g., MLCA) project portions of the composite into per-layer guidance streams.
Diffusion-Based Decoupling: CLD frequently uses conditional latent diffusion models to learn to separate composite content into editable RGBA (or RGB plus alpha) layers, supporting joint or independent denoising of background and foreground streams with prompt-based conditioning (Yang et al., 2024, Chen et al., 26 Nov 2025, Chen et al., 22 Feb 2026, Liu et al., 20 Nov 2025).
Palette/Uniform-Region Refinement: For designs or illustrations, prior knowledge about uniformity is leveraged post-hoc to palette-snap layer colors or refine alpha masks, stabilizing layer edges and removing speckle (Suzuki et al., 29 Sep 2025, Zhang et al., 16 Mar 2026).

3. Training Objectives, Losses, and Supervision

CLD is supervised with multi-part objectives tailored to the layered composition pipeline:

Per-Branch Reconstruction: Each hypothesized component or decomposition stream is tasked to reconstruct its clean or degraded ground-truth target using regression losses (e.g., Smooth L1 loss over reconstructed outputs).

$\mathcal{L}_{\rm smoothL1} = \sum_{i=1}^{N+1} \| y'_i - y_i \|_{\rm smoothL1}$
Perceptual/Adversarial/VGG/LPIPS Losses: Perceptual similarity metrics encourage photorealistic quality, often targeting the most human-salient branch (e.g., the “clean” restored image or a partial recombination) (Zhang et al., 2024, Yang et al., 2024, Kang et al., 2 Jan 2025).
Alpha, Mask, or Completion Losses: In instance- and matting-oriented pipelines, cross-entropy, IoU, and SSIM losses supervise the alpha prediction and amodal completion of occluded regions (Suzuki et al., 29 Sep 2025, Chen et al., 22 Feb 2026).
Consistency and Composite Losses: Some CLD models enforce that the recomposed output through alpha blending recovers the input image, with L1 or MSE penalties (Yang et al., 2024).

$I(x) = \sum_{i=0}^{N-1} \alpha_i(x) F_i(x) \prod_{j<i} (1 - \alpha_j(x))$
Prompt/Source Classification Losses: To support prompt-driven selection, multi-label BCE losses train a branch classifier to anticipate the presence or absence of each degradation/component (Zhang et al., 2024).
Layer-Specific Loss Weights: In structured decompositions (e.g., illustration), the loss function weights are layer-adaptive to reflect domain knowledge; e.g., strong L1 on line art sharpness, MSE on color layers, sparsity for highlight/shadow (Zhang et al., 16 Mar 2026).

4. User Controllability and Prompt Conditioning

A distinguishing feature of CLD models is the ability to steer decomposition via user-defined or learned prompts:

Binary or Soft Control Vectors: Supported by source classifiers and prompt converters, direct user control of which degradations/effects are retained or removed is implemented as a binary or continuous vector feeding the recombination process (Zhang et al., 2024).
Direct Prompt Conditioning: Recent CLD variants accept flexible spatial (masks, boxes, points) or language prompts at inference time, unified as RGB prompt images in the latent-diffusion encoder (Chen et al., 22 Feb 2026). Multimodal fusion modules (including linearly-efficient attention) allow for high-dimensional control.
Semantic Layer Embedding: In illustration pipelines, learnable per-layer semantic embeddings injected into the transformer-per-token bias the network to treat contiguous token blocks as belonging to a specific semantic layer, reducing cross-layer interference and promoting factorized decompositions (Zhang et al., 16 Mar 2026).
Interactive Editing: Once layers are extracted, downstream editing covers moving, recoloring, reblending, or entirely removing specified layers without re-running the model (Zhang et al., 2024, Suzuki et al., 29 Sep 2025, Liu et al., 20 Nov 2025, Yang et al., 2024).
Prompt Quality and Influence: Empirically, spatial prompts (masks, boxes) yield tighter foreground extraction and completion, while language commands can improve performance on occluded or amodal instances (Chen et al., 22 Feb 2026).

5. Benchmarks, Datasets, and Evaluation Metrics

Robust CLD evaluation demands appropriate datasets and metrics:

Synthetic, Semi-Synthetic, and Real Multilayer Datasets: Prominent data sources include Crello (raster designs) (Suzuki et al., 29 Sep 2025), PrismLayersPro (graphic design, (Liu et al., 20 Nov 2025)), RefLade (1.1M curated layered instances (Chen et al., 22 Feb 2026)), and MuLAn (multi-layer annotated natural images (Tudosiu et al., 2024)).
Quality and Consistency Metrics: Standard metrics comprise PSNR, SSIM, FID, LPIPS, IoU on masks, and user-preference studies (Zhang et al., 2024, Yang et al., 2024, Chen et al., 22 Feb 2026, Suzuki et al., 29 Sep 2025).
Layer Edit Distance: The DTW+layer-edit protocol corrects for over- or under-splitting of layers by allowing merge operations and aligning predicted and ground-truth layer sequences via dynamic programming (Suzuki et al., 29 Sep 2025).
Prompt-Conditioned Scores: Human-Preference-Aligned (HPA) metrics aggregate per-sample LPIPS (visible region preservation), CLIP-based completion scores (occlusion infilling), and FID (compositional fidelity), correlating strongly with human judgments (Chen et al., 22 Feb 2026).

6. Variants, Ablation Insights, and Failure Modes

CLD research has yielded the following empirical and architectural findings:

Parameter-Free vs. Learnable Decomposition: Channel-split CLD modules introduce essentially zero parameters, while alternatives (e.g., per-branch CNN or transformer, or mixing networks) offer modest improvements at nonzero parameter cost; best observed gains are within 0.2 dB PSNR on standard restoration tasks (Zhang et al., 2024).
Effect of Losses and Condition Guidance: Removing perceptual losses, unconditional conditioning, or layer-composite auxiliary targets generally harms layer separation and structural fidelity (Liu et al., 20 Nov 2025).
Layer Interference and Layer Embedding: Lightweight layer semantic embeddings in the transformer backbone are essential to prevent feature bleeding across layers. Removing LSE introduces cross-talk that degrades both semantic clarity and visual sharpness (Zhang et al., 16 Mar 2026).
Typical Failure Modes: Persistent limitations encompass the handling of extremely fine or highly occluded components (e.g., tiny icons, intricate text), limited by token resolution, training data diversity, or strong prior dependencies in the generative prior or matting model. Severe occlusion may preclude accurate hallucination even with prompt cues (Liu et al., 20 Nov 2025, Yang et al., 2024).
Domain Specificity: Generic matting or inpainting models adapt poorly to highly stylized, synthetic, or structured domains such as anime illustration; workflow-aware or palette-informed methods (e.g., LayerD (Suzuki et al., 29 Sep 2025), Workflow-Aware SLD (Zhang et al., 16 Mar 2026)) outperform object-based strategies.

7. Applications and Future Directions

CLD unlocks a range of practical and research applications:

Editable Design Layer Recovery: Real-world integration into tools like PowerPoint and Photoshop allows non-destructive, layer-level editing of previously flattened graphics (Liu et al., 20 Nov 2025, Suzuki et al., 29 Sep 2025).
Generative and Layer-wise Editing: Layered outputs support prompt-based addition, removal, or replacement of scene instances and effects, facilitating semantic manipulation for creator and AI-augmented workflows (Tudosiu et al., 2024, Kang et al., 2 Jan 2025).
Structured Illustration and Synthesis: In stylized domains, CLD matches human production layers (e.g., line, color, highlight, shadow), supporting recoloring, lighting, and animation with faithful lighting consistency (Zhang et al., 16 Mar 2026).
Benchmarks for Composable Diffusion/Editing Models: Large-scale datasets such as RefLade (Chen et al., 22 Feb 2026) provide diverse prompt-conditioned layered ground truth for training next-generation compositional models.

Possible extensions include video layer decomposition using temporal consistency, interactive or region-local iterative editing, explicit modeling of complex effects (smoke, interreflections), and integration with multimodal understanding for richer text/scene control (Yang et al., 2024, Chen et al., 22 Feb 2026, Liu et al., 20 Nov 2025).

References

"Strong and Controllable Blind Image Decomposition" (Zhang et al., 2024)
"LayerD: Decomposing Raster Graphic Designs into Layers" (Suzuki et al., 29 Sep 2025)
"MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation" (Tudosiu et al., 2024)
"LayeringDiff: Layered Image Synthesis via Generation, then Disassembly with Generative Knowledge" (Kang et al., 2 Jan 2025)
"Controllable Attention for Structured Layered Video Decomposition" (Alayrac et al., 2019)
"Authoring image decompositions with generative models" (Rock et al., 2016)
"Controllable Layer Decomposition for Reversible Multi-Layer Image Generation" (Liu et al., 20 Nov 2025)
"From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition" (Chen et al., 26 Nov 2025)
"Workflow-Aware Structured Layer Decomposition for Illustration Production" (Zhang et al., 16 Mar 2026)
"Generative Image Layer Decomposition with Visual Effects" (Yang et al., 2024)
"Referring Layer Decomposition" (Chen et al., 22 Feb 2026)