LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas (2510.20820v2)

Published 23 Oct 2025 in cs.CV

Abstract: Despite their impressive visual fidelity, existing personalized generative models lack interactive control over spatial composition and scale poorly to multiple subjects. To address these limitations, we present LayerComposer, an interactive framework for personalized, multi-subject text-to-image generation. Our approach introduces two main contributions: (1) a layered canvas, a novel representation in which each subject is placed on a distinct layer, enabling occlusion-free composition; and (2) a locking mechanism that preserves selected layers with high fidelity while allowing the remaining layers to adapt flexibly to the surrounding context. Similar to professional image-editing software, the proposed layered canvas allows users to place, resize, or lock input subjects through intuitive layer manipulation. Our versatile locking mechanism requires no architectural changes, relying instead on inherent positional embeddings combined with a new complementary data sampling strategy. Extensive experiments demonstrate that LayerComposer achieves superior spatial control and identity preservation compared to the state-of-the-art methods in multi-subject personalized image generation.

Summary

The paper introduces a layered canvas and locking mechanism enabling interactive spatial control and scalable multi-subject T2I generation.
It employs position-aware latent encoding with transparent pruning to preserve identity and avoid occlusion in personalized scenes.
Experimental results demonstrate superior performance in identity preservation, text adherence, and user satisfaction over baseline methods.

LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas

Introduction and Motivation

LayerComposer introduces a new paradigm for personalized text-to-image (T2I) generation, specifically targeting the limitations of existing diffusion-based personalization methods in spatial control and multi-subject scalability. Prior approaches, such as DreamBooth, IP-Adapter, and ControlNet, either require auxiliary control maps or suffer from linear memory and computational growth with the number of personalized subjects. LayerComposer addresses these bottlenecks by proposing a layered canvas representation and a locking mechanism, enabling users to interactively place, resize, and lock subjects on a canvas, akin to professional image-editing workflows.

Figure 1: LayerComposer enables interactive multi-subject T2I generation with spatial placement, resizing, and locking, providing a Photoshop-like user experience.

Layered Canvas Representation and Locking Mechanism

The core innovation is the layered canvas, where each subject (and optionally the background) is encoded as a distinct RGBA layer. The alpha channel provides a spatial mask, and a binary locking flag determines whether the subject should be strictly preserved or allowed to adapt to the scene context. This design resolves occlusion ambiguities and decouples the conditioning sequence length from the number of subjects, enabling efficient scaling.

The locking mechanism leverages positional embeddings: locked layers share positional indices with the noisy latent tokens, ensuring high-fidelity preservation, while unlocked layers receive unique indices to avoid appearance mixing in overlapping regions. This is achieved without architectural changes, relying on a model-data co-design and a complementary data sampling strategy during training.

Figure 2: Locking-aware data sampling: locked layers are sampled from the target image for pixel alignment, while unlocked layers are sampled from other images of the same identity, enabling controlled variation.

Model Architecture and Training Pipeline

LayerComposer builds on a pretrained latent-based diffusion transformer (DiT). The pipeline consists of:

Layer Latent Extraction: Each RGBA layer is encoded via a VAE to obtain latent tokens.
Positional Embedding: Locked layers use $[0, x, y]$ positional embeddings; unlocked layers use $[j, x, y]$ where $j$ is a unique layer index.
Transparent Latent Pruning: Only tokens corresponding to non-transparent regions are retained, improving scalability.
Layer Conditioning Integration: Pruned latents from all layers are concatenated with noisy image latents for DiT conditioning.

Training employs a locking-aware data sampling strategy, with LoRA adapters finetuned on the DiT backbone using a flow matching loss. The dataset comprises 32M images across 6M scenes, with instance segmentation to extract human subjects and backgrounds.

Figure 3: LayerComposer pipeline: text prompt and layered canvas are encoded, positional embeddings are assigned based on locking status, and transparent latent pruning enables scalable multi-subject conditioning.

Experimental Evaluation

LayerComposer is evaluated on 1P, 2P, and 4P personalization benchmarks, using FFHQ-in-the-wild identities and diverse prompts. Metrics include ArcFace for identity preservation, VQAScore for text alignment, HPSv3 for image quality, and user studies for overall preference.

4P Personalization: Competing methods (FLUX Kontext, Overlay Kontext, Qwen-Image-Edit, Gemini 2.5 Flash Image) frequently fail to preserve all subjects or introduce artifacts, especially under occlusion. LayerComposer consistently generates coherent, high-fidelity images, robustly handling occlusions and spatial arrangements.

Figure 4: In 4P personalization, LayerComposer preserves identities and spatial layout, outperforming baselines that distort or omit subjects, especially under occlusion.

2P Personalization: Baselines (UniPortrait, StoryMaker, UNO, OmniGen2) often miss, duplicate, or distort subjects. LayerComposer produces natural interactions and distinct identities, with superior user preference and identity scores.

Figure 5: In 2P personalization, LayerComposer generates coherent scenes with natural subject interaction and identity preservation, unlike baselines with missing or duplicated subjects.

1P Personalization: Existing methods tend to copy-paste the reference face with limited flexibility. LayerComposer achieves realistic outputs faithful to both identity and prompt, supporting diverse expressions and activities.

Figure 6: In 1P personalization, LayerComposer captures diverse expressions and activities, avoiding copy-paste artifacts seen in other methods.

Ablation Studies and Analysis

Ablation studies demonstrate the effectiveness of the locking mechanism and layered canvas. Progressive locking preserves subject pose and appearance with only necessary lighting adjustments, while unlocked layers adapt flexibly. Without the layered canvas, occlusion leads to missing details; the layered approach resolves these issues.

Figure 7: Ablation: locking preserves selected subjects with minimal changes; layered canvas resolves occlusion, preventing loss of overlapping details.

LayerComposer also supports seamless integration of backgrounds as additional layers, enabling natural human-background interactions under coherent lighting.

Figure 8: Layered canvas with background: humans interact naturally with the background, maintaining coherent lighting and spatial relationships.

Quantitative Results

LayerComposer achieves top-2 image quality (HPSv3) across all benchmarks and substantially outperforms other methods in multi-subject identity preservation (ArcFace). Notably, ArcFace scores for 1P are lower than copy-paste baselines due to greater pose/expression diversity, but LayerComposer excels in prompt adherence (VQAScore) and user preference.

Limitations and Future Directions

LayerComposer's performance degrades for scenes with more than four subjects, primarily due to data limitations and base model robustness. It also struggles with complex spatial reasoning tasks, such as natural placement of subjects in challenging backgrounds. Future work should integrate vision-LLMs for enhanced reasoning and expand high-quality multi-subject datasets.

Conclusion

LayerComposer establishes a scalable, interactive framework for personalized T2I generation, introducing a layered canvas and locking mechanism for fine-grained spatial and identity control. It demonstrates superior compositional fidelity and robustness across multi-subject scenarios, setting a new standard for interactive generative workflows. The layered canvas paradigm is extensible to broader compositional and editing tasks, with future research needed to address reasoning and scalability beyond four subjects.