CreatiLayout-AM: Occlusion-Aware Layout Generation
- CreatiLayout-AM is a training-based, amodal-mask supervised variant that refines the CreatiLayout-SD3 backbone to resolve dense overlap challenges in layout-to-image generation.
- It employs explicit mask-level supervision on instance attention to mitigate object blending, spatial ambiguity, and visual distortion in occluded scenes.
- Empirical results on OverLayBench show significant improvements in overlap fidelity, particularly boosting O-mIoU and instance separation in simpler overlap regimes.
CreatiLayout-AM is a training-based, amodal-mask–supervised variant of the CreatiLayout family for layout-to-image generation, introduced alongside OverLayScore and OverLayBench to address dense-overlap failure modes such as object blending, spatial ambiguity, and visual distortion. It is designed for layouts in which bounding boxes have large spatial overlaps and the overlapping instances are semantically similar, and it uses explicit mask-level supervision to make a diffusion Transformer more robust to occlusion, large overlapping regions, and minimal semantic distinction among overlapping instances (Li et al., 23 Sep 2025).
1. Definition, provenance, and nomenclature
CreatiLayout-AM was introduced in "OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps" as an initial, simple baseline within a broader framework for evaluating and improving layout-to-image generation under realistic, occlusion-heavy settings (Li et al., 23 Sep 2025). In that framework, the model is not presented as a separate family of generative architectures, but as a supervised variant of CreatiLayout-SD3 that incorporates amodal masks during training.
A common source of confusion is terminological. The earlier paper "CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation" introduces CreatiLayout, SiamLayout, Layout Adapter, and M3-Attention, but it does not define a module or variant named “AM” (Zhang et al., 2024). In the OverLayBench setting, “AM” denotes the amodal-mask–supervised extension rather than a component from the original CreatiLayout paper. This clarification matters because the original CreatiLayout contribution is architectural—layout as an explicit modality in an MM-DiT—whereas CreatiLayout-AM is specifically an occlusion-aware training refinement built on that base.
The underlying backbone remains CreatiLayout-SD3, described as a Siamese multimodal Diffusion Transformer. Layouts are encoded through explicit layout tokens for each instance, combining bounding box geometry and instance caption tokens, while the DiT conditions generation on both global and instance-specific descriptions. Cross-attention maps between image tokens and instance-specific layout tokens provide interpretable, instance-wise attention signals, which become the locus of the amodal supervision (Li et al., 23 Sep 2025).
2. Problem setting and overlap complexity
The model is motivated by a specific weakness in layout-to-image generation: performance degrades when bounding boxes overlap substantially and when the overlapping instances are minimally distinguishable at the semantic level, as in cases such as “two men” or “two cars” (Li et al., 23 Sep 2025). Existing methods often treat layout constraints as soft, guidance-only signals or lack occlusion-aware mechanisms, and existing benchmarks derived from COCO-style subsets, HiCo-7k, and LayoutSAM are described as biased toward low-overlap scenes. In consequence, they do not reliably diagnose or stress-test dense-overlap failure modes.
OverLayScore is proposed to quantify the difficulty of overlapping layouts by jointly aggregating spatial overlap and semantic similarity across all overlapping pairs:
where a layout contains objects, object has instance caption and normalized bounding box , and is the CLIP-based cosine similarity between the embeddings of instance captions and (Li et al., 23 Sep 2025). Because boxes are normalized to image coordinates and IoU normalizes overlap by union area, the metric captures what the paper calls spatial-semantic entanglement: large IoU and high semantic similarity both increase difficulty.
Empirically, higher OverLayScore correlates strongly with lower generation fidelity. The reported degradation appears in both spatial metrics, such as mIoU and O-mIoU, and in qualitative artifacts across representative layout-to-image models including GLIGEN, InstanceDiffusion, and CreatiLayout (Li et al., 23 Sep 2025). This motivates stratification into simple, regular, and complex overlap regimes and provides the evaluation context in which CreatiLayout-AM is positioned.
3. Architectural basis and amodal-mask supervision
CreatiLayout-AM retains the CreatiLayout-SD3 conditioning interface: a global caption plus a set of instance tuples formed by bounding boxes and instance captions. The distinctive addition is amodal supervision during training only. An amodal mask represents the complete object extent, including occluded parts. The rationale given in the paper is that, under dense overlaps, forcing attention to align to the full object shape mitigates blending and encourages instance separation and correct layering (Li et al., 23 Sep 2025).
The supervision acts directly on instance-wise attention maps. Let denote the attention map between image tokens and the layout token for instance , where 0 indexes spatial token positions, and let 1 be the corresponding amodal mask. Two alignment objectives are added. The first is a token-level alignment loss, described as a normalized overlap of attention and mask that encourages the attention mass to lie inside the amodal extent. The second is a pixel-level alignment loss, defined as binary cross-entropy on normalized attention values treated as probabilities. These losses are combined with the standard denoising objective of the latent diffusion or DiT backbone, with weights 2 and 3 in CreatiLayout-AM (Li et al., 23 Sep 2025).
The design choice is operationally significant because no amodal masks are required at inference time. The supervision shapes the training dynamics of instance-wise attention maps, but deployment uses the same inputs as the base CreatiLayout model: the global caption, per-instance captions, and bounding boxes. This preserves the original inference interface while introducing overlap awareness during optimization (Li et al., 23 Sep 2025).
The same amodal-attention alignment idea is also applied to EliGen as an auxiliary baseline, denoted EliGen-AM. Because EliGen lacks explicit per-instance layout tokens, instance-level attention is approximated by averaging attention across the local description tokens for each instance before applying the same AM losses (Li et al., 23 Sep 2025). This establishes that the method is not limited to one architecture.
4. Curated amodal-mask data and optimization procedure
The AM training set is constructed through synthetic occlusion generation. For each Flux-generated image, SAM 2 is used to extract amodal object masks, and objects are cropped into an object-mask pool 4. Random objects are then pasted into target images at locations chosen to create overlaps, thereby producing controlled occlusions. RealVisXL_V5.0_Lightning is used for object removal when needed during dataset construction, and Qwen2.5-VL-32B generates global and instance captions for original and pasted objects so that textual conditioning reflects the overlap configuration (Li et al., 23 Sep 2025).
The resulting curated AM dataset contains approximately 67.8k images. CreatiLayout-SD3 is fine-tuned for 3,500 steps on 8× NVIDIA RTX A6000 (48GB), with batch size 16, AdamW, bf16 precision, learning rate 5, a linear learning-rate scheduler with 500 warm-up steps, LoRA rank 32, and DDP. The detailed implementation notes also state that gradient accumulation is used, with per-GPU batch size 1 and accumulation steps 2 in the hyperparameter table (Li et al., 23 Sep 2025).
The benchmark with which CreatiLayout-AM is evaluated, OverLayBench, is itself curated to balance overlap difficulty. It begins from real-image captions from the COCO training set extracted with Qwen2.5-VL-7B, generates candidate images with Flux.1-dev using 28 steps, refines global captions using Qwen2.5-VL-7B, and grounds foreground instances with Qwen2.5-VL-32B. Images are kept when they contain 1–10 valid overlapping bounding-box pairs, with validity defined by 6 and intersection area exceeding 7 of the image. Pairwise relationships between overlapping instances are extracted with Qwen and then subjected to human curation for box accuracy, caption alignment, and relationship validity (Li et al., 23 Sep 2025).
The final benchmark contains 4,052 layouts: 2,052 simple, 1,000 regular, and 1,000 complex. Each sample includes a global caption, instance-level captions, bounding boxes, and relationship phrases, with OverLayScore computed per example for split stratification (Li et al., 23 Sep 2025).
5. Evaluation protocol and empirical findings
OverLayBench evaluates spatial alignment, semantic alignment, relationship correctness, and distributional quality. Spatial alignment uses mIoU, with predicted boxes matched to ground truth via the Hungarian algorithm, and O-mIoU, defined as mIoU computed only over the intersection regions of overlapping instances, making it particularly sensitive to occluded areas. Semantic and relational evaluation uses CLIPScore in global and local forms with CLIP ViT-B/32, SR_E for entity correctness, SR_R for relationship correctness, and FID. For consistency, three images per layout are generated with fixed seeds 20251202, 20251203, and 20251204, and Qwen2.5-VL-32B is used for detection and QA in the computation of SR_E and SR_R (Li et al., 23 Sep 2025).
Across training-based methods, spatial metrics deteriorate from the simple split to the complex split. The paper notes that DiT-based models retain stronger semantic alignment and visual quality, while CreatiLayout-FLUX and DreamRender with depth score highly overall but do not directly address amodal occlusion (Li et al., 23 Sep 2025). Within this setting, CreatiLayout-AM is evaluated directly against the CreatiLayout model using the SD3 backbone, averaged over three runs.
On the simple split, CreatiLayout-AM improves mIoU from 58.78% to 61.16% and O-mIoU from 32.52% to 37.69%, corresponding to gains of +4.05% and +15.90%, respectively. SR_E rises from 72.34% to 73.33%, and SR_R rises from 84.45% to 84.84%, while CLIP_Global, CLIP_Local, and FID show slight regressions: 37.29 to 37.17, 27.49 to 27.44, and 27.51 to 27.76 (Li et al., 23 Sep 2025).
On the regular split, the same pattern persists but with smaller effect sizes. mIoU increases from 47.04% to 47.38%, O-mIoU from 20.67% to 21.79%, SR_E from 62.60% to 63.13%, and SR_R from 78.31% to 78.71%, while CLIP_Global and CLIP_Local undergo minor drops and FID changes from 45.57 to 46.34 (Li et al., 23 Sep 2025).
On the complex split, gains largely plateau. mIoU decreases slightly from 44.24% to 43.97%, O-mIoU rises marginally from 18.05% to 18.07%, SR_E increases from 52.10% to 52.49%, and SR_R decreases from 79.98% to 79.77%. CLIP_Global and CLIP_Local show small declines, and FID changes from 53.29 to 53.48 (Li et al., 23 Sep 2025). The strongest quantitative benefit therefore appears in O-mIoU for the simple and regular regimes, which the paper interprets as improved fidelity in overlap regions.
Qualitative comparisons report more coherent object separations, cleaner boundaries in overlap regions, and fewer fusion artifacts than the base CreatiLayout model. The paper attributes this to attention-driven mask alignment that facilitates rendering of occluded extents and improves perceptual realism at high OverLayScore (Li et al., 23 Sep 2025).
Ablations with EliGen-AM support the claim that amodal attention alignment is model-agnostic. Applying the same losses to EliGen yields improvements across splits: in the simple split, mIoU +2.24%, O-mIoU +6.20%, SR_E +0.38%, SR_R +0.45%, CLIP_Local +1.03%, and FID −8.45%; in the regular split, mIoU +1.50%, O-mIoU +3.74%, SR_E +0.41%, SR_R +2.43%, CLIP_Local +1.18%, and FID −4.67%; in the complex split, mIoU +1.43%, O-mIoU +1.91%, SR_E +3.09%, CLIP_Local +0.89%, and FID −2.00% (Li et al., 23 Sep 2025).
A user study over 60 image pairs, excluding “No Preference,” reports preferences for CreatiLayout-AM of 55.2% on the simple split, 51.9% on the regular split, and 46.8% on the complex split. These preferences align with the stronger quantitative gains at lower OverLayScore and the more moderate effect at the hardest difficulty level (Li et al., 23 Sep 2025).
6. Interpretation, limitations, and practical significance
The paper’s explanation for the model’s behavior is that amodal supervision regularizes instance-wise attention so that it covers full object extents rather than collapsing into shared overlap regions. On this account, the main benefit is reduced object fusion and distortion, especially at moderate OverLayScore, with O-mIoU acting as the most sensitive quantitative indicator (Li et al., 23 Sep 2025). This suggests that the method is less about improving general text-image fidelity and more about stabilizing instance separation under occlusion.
Several limitations are explicitly identified. First, the curated AM training set does not fully match the hardest OverLayBench layouts, creating a distribution shift in the complex split and leading to smaller gains or occasional regressions in mIoU and CLIP-based measures. Second, the amodal masks are synthesized through SAM 2 and compositing, so inaccuracies or biases in the synthesized occlusions may limit generalization to real occlusions. Third, fine-tuning DiT backbones with per-instance attention supervision requires multi-GPU resources and adds training overhead, even though inference remains unchanged. Fourth, the scope is limited to natural scenes and categories present in Flux-generated content, so generalization beyond benchmark categories or extreme multi-scale overlaps may be constrained (Li et al., 23 Sep 2025).
From a deployment perspective, CreatiLayout-AM preserves the same interface as CreatiLayout—bounding boxes plus captions—with minimal inference-time overhead. The paper identifies design and scene-planning tools as settings in which overlaps are common and reliable instance separation is critical (Li et al., 23 Sep 2025). A plausible implication is that the model is most useful where structural faithfulness in crowded compositions matters more than marginal changes in global CLIPScore or FID.
The broader significance of CreatiLayout-AM lies in its role as a principled baseline rather than as a complete solution to overlap-heavy layout-to-image generation. Combined with OverLayScore and OverLayBench, it operationalizes dense-overlap difficulty, provides a stratified evaluation protocol, and demonstrates that overlap-aware attention alignment via amodal masks is a practical ingredient for handling densely overlapping layouts (Li et al., 23 Sep 2025). The associated project page is https://mlpc-ucsd.github.io/OverLayBench.