OverLayBench: Dense Overlap Benchmark

Updated 4 July 2026

OverLayBench is a benchmark that systematically evaluates layout-to-image generation under dense overlapping conditions by quantifying spatial-semantic entanglement using OverLayScore.
It employs a curated COCO-based dataset with high-quality annotations to address failures like object fusion, misalignment, and blended categories.
The benchmark reveals that existing models degrade in performance on overlap-specific metrics, highlighting the need for dedicated approaches to tackle dense overlap challenges.

Searching arXiv for the benchmark and closely related layout-guided evaluation work. OverLayBench is a benchmark for layout-to-image generation focused specifically on dense, difficult overlapping layouts. It was introduced to evaluate a failure mode that prior layout-guided generation benchmarks under-test: the degradation of object distinctness, category fidelity, and spatial coherence when multiple bounding boxes overlap heavily, especially when the overlapping instances are semantically similar (Li et al., 23 Sep 2025). The benchmark is organized around OverLayScore, a spatial-semantic difficulty measure defined over overlapping object pairs, and a curated dataset with high-quality annotations and a balanced distribution across different levels of overlap complexity. In the paper’s framing, OverLayBench functions both as a diagnostic benchmark and as an empirical argument that overlap-heavy layout control remains substantially unsolved in contemporary layout-to-image systems (Li et al., 23 Sep 2025).

1. Motivation and problem setting

OverLayBench was created from the observation that existing layout-to-image methods can often satisfy coarse spatial control, yet break down when layouts contain significant overlap between bounding boxes (Li et al., 23 Sep 2025). The paper identifies two primary challenges: large overlapping regions and overlapping instances with minimal semantic distinction. Those conditions make it difficult for a generator to preserve object identity while also respecting the prescribed layout. The observed failure modes include object blending or fusion, spatial ambiguity, visual distortion, missing or duplicated objects, wrong categories, and bounding-box misalignment (Li et al., 23 Sep 2025).

The appendix groups common errors into five classes: Incorrect Object Number, Object Fusion, Object Distortion, Incorrect Category, and BBox Misalignment (Li et al., 23 Sep 2025). This error taxonomy is central to the benchmark’s purpose. OverLayBench is not intended as a general benchmark for all forms of layout guidance; it is a targeted testbed for overlap-heavy layouts in which object disentanglement becomes the dominant challenge.

A common misconception in this area is that overlap is already adequately covered by standard layout-guided benchmarks. The benchmark argues otherwise. Existing benchmarks such as COCO-based evaluation, HiCo-7k, and LayoutSAM are described as biased toward easier layouts, with relatively few examples involving severe overlap (Li et al., 23 Sep 2025). Related benchmarks such as 7Bench and the later C-Bench/O-Bench family emphasize joint semantic and spatial evaluation for layout-guided text-to-image generation, including scenarios such as small bounding boxes, overlap, attributes, and relations, but they are not centered on dense overlap as the primary stressor (Izzo et al., 18 Aug 2025, Parolari et al., 28 Apr 2026). This suggests that OverLayBench occupies a narrower but more severe regime within the broader space of layout-faithfulness evaluation.

2. Overlap difficulty and OverLayScore

The benchmark formalizes overlap difficulty through two factors. The first is geometric: larger spatial overlap makes generation harder. The second is semantic: higher semantic similarity between overlapping instances also makes generation harder (Li et al., 23 Sep 2025). The paper’s qualitative examples and quantitative analysis indicate that the combination of large overlap and high semantic similarity is particularly destructive.

To quantify this, the authors propose OverLayScore. For a layout with $K$ objects, where $p_k$ is the instance caption and $B_k$ the normalized bounding box for object $k$ , the score is defined as

$\mathtt{OverLayScore} = \sum_{(i, j):\;\text{IoU}(B_i, B_j) > 0} \text{IoU}(B_i, B_j) \cdot \cos \big(\langle p_i, p_j \rangle \big),$

where $\text{IoU}(B_i, B_j)$ is the intersection-over-union between boxes $B_i$ and $B_j$ , and $\cos \big(\langle p_i, p_j \rangle \big)$ is the CLIP-based cosine similarity between captions $p_i$ and $p_k$ 0 (Li et al., 23 Sep 2025). Only pairs with positive overlap contribute, and the published formula does not include any additional normalization term.

OverLayScore is therefore a pairwise spatial-semantic entanglement score. A layout receives a higher score when many pairs overlap, when the overlaps are large, and when the instance captions are semantically close. The authors use the score both as a difficulty predictor and as a dataset diagnostic. On a subset of COCO with scenes containing 2 to 10 objects, they split layouts into simple, regular, and complex groups based on OverLayScore, sample 100 layouts per category, and evaluate GLIGEN, InstanceDiffusion, and CreatiLayout. Performance for all three consistently declines as OverLayScore increases, which the paper interprets as evidence that the metric is meaningful as a difficulty indicator (Li et al., 23 Sep 2025).

The same score is then used to compare benchmark distributions. COCO, HiCo, and LayoutSAM are reported as strongly concentrated in the low-score regime, whereas OverLayBench is designed to cover the difficulty range more evenly (Li et al., 23 Sep 2025). This is one of the benchmark’s defining claims: difficulty is not inferred post hoc from model performance, but estimated directly from layout geometry and caption similarity.

3. Dataset construction and annotation structure

OverLayBench is built through a three-stage curation pipeline: reference image generation, image grounding and captioning, and scoring and human curation (Li et al., 23 Sep 2025). The starting point is the COCO training set. The authors use Qwen2.5-VL-7B to extract image captions from COCO training images, then feed those captions into Flux.1-dev to generate new reference images. This yields about 86,000 generated images paired with captions (Li et al., 23 Sep 2025).

In the second stage, a refinement and grounding pass is applied. Because generated images may not perfectly match their input captions, another captioning pass is performed using Qwen-2.5-VL-7B to produce refined global image captions. The authors then use Qwen rather than GroundingDINO for instance grounding, extracting foreground object detections, bounding boxes, and local instance descriptions (Li et al., 23 Sep 2025). Images are retained only if they contain one to ten valid overlapping bounding-box pairs. A pair is considered valid if the IoU is greater than 5% and the intersection area exceeds 1% of the total image area. For each overlapping pair, Qwen is also prompted to generate pairwise relationship phrases describing spatial and semantic relations (Li et al., 23 Sep 2025).

The third stage is human curation and balancing. The paper states that the authors manually verify bounding box accuracy, alignment between image content and the global caption, alignment between image content and local captions, and validity of relationship descriptions, with the explicit aim of keeping the benchmark free from hallucinations (Li et al., 23 Sep 2025). A custom Web-UI is used to assist auditing by displaying the image, image caption, bounding boxes, instance captions, and relationship captions.

After validation and filtering, the final benchmark contains 4,052 layouts distributed across three difficulty-based subsets (Li et al., 23 Sep 2025).

Split	Count
Simple	2,052
Regular	1,000
Complex	1,000

Each example includes a global image caption, instance-level captions, bounding boxes, and pairwise relationship phrases for overlapping instances (Li et al., 23 Sep 2025). The text provided does not report a train/validation/test partition for OverLayBench itself; the benchmark is presented primarily as an evaluation benchmark with Simple, Regular, and Complex difficulty-based subsets.

4. Evaluation protocol and benchmark metrics

OverLayBench evaluates whether layout-to-image models can faithfully generate images from layouts under overlap-heavy conditions. For each layout and method, the protocol generates 3 images using fixed seeds 20251202, 20251203, and 20251204 (Li et al., 23 Sep 2025). Standard mIoU uses the Hungarian algorithm to match each ground-truth box with predicted boxes. Predicted boxes and VQA-style judgments are obtained using Qwen-2.5-VL-32B (Li et al., 23 Sep 2025).

The benchmark reports both standard and overlap-specific metrics. Standard mIoU measures mean IoU between matched predicted and ground-truth boxes. CLIP is reported as CLIP $p_k$ 1 and CLIP $p_k$ 2 using pretrained CLIP ViT-B/32. FID is used for image quality (Li et al., 23 Sep 2025).

Two metrics are specifically designed to expose overlap failures. The first is O-mIoU, or Overlap-mIoU, which computes mIoU within the ground-truth overlap regions and the corresponding predicted regions. In the appendix it is further described as mIoU over the cropped intersection region between two related instances (Li et al., 23 Sep 2025). The second is $p_k$ 3, the Success Rate of Relationship, defined as the percentage of object pairs whose predicted relationship matches the ground truth. The entity-level analogue is $p_k$ 4, the percentage of instances whose generated appearance matches the instance-level description (Li et al., 23 Sep 2025).

This metric suite reflects a specific design position. Global metrics such as mIoU and CLIP can register acceptable average performance even when the most entangled overlap regions are rendered poorly. O-mIoU isolates exactly those regions. A plausible implication is that OverLayBench treats overlap not as a nuisance variable within ordinary grounding evaluation, but as a primary locus of error deserving dedicated measurement. This differs from benchmarks such as 7Bench and C-Bench/O-Bench, whose principal evaluative abstraction is joint semantic and spatial alignment at the image level through TIFA-style semantic scoring and IoU-threshold AUC layout scoring (Izzo et al., 18 Aug 2025, Parolari et al., 28 Apr 2026).

5. Empirical findings and benchmark behavior

The principal empirical result is that all existing methods degrade as overlap complexity increases, especially on spatial overlap metrics (Li et al., 23 Sep 2025). The paper evaluates training-based models including GLIGEN, InstanceDiff, MIGC, HiCo, 3DIS, CreatiLayout-SD3, CreatiLayout-FLUX, EliGen, and DreamRender, and also reports several training-free methods in the appendix (Li et al., 23 Sep 2025).

Across Simple $p_k$ 5 Regular $p_k$ 6 Complex, all models lose performance, particularly on mIoU, O-mIoU, and $p_k$ 7 (Li et al., 23 Sep 2025). The decline is illustrated explicitly for several models. GLIGEN drops from mIoU 60.54 and O-mIoU 36.22 on Simple to 50.79 and 23.85 on Complex. InstanceDiff drops from 71.21 and 49.99 to 53.68 and 25.63. CreatiLayout-FLUX drops from 71.17 and 49.80 to 54.50 and 28.97 (Li et al., 23 Sep 2025). The paper emphasizes O-mIoU as especially informative because it focuses on overlap regions.

No single model dominates every metric. On OverLayBench-Simple, InstanceDiff has the best mIoU at 71.21 and the best O-mIoU at 49.99, DreamRender has the best $p_k$ 8 at 88.80 and the best CLIP $p_k$ 9 at 30.11, while CreatiLayout-FLUX has the best $B_k$ 0 at 90.87 and the best FID at 23.79 (Li et al., 23 Sep 2025). On OverLayBench-Regular, InstanceDiff has the best mIoU at 60.08, CreatiLayout-FLUX has the best O-mIoU at 35.51, DreamRender has the best $B_k$ 1 at 83.52, and CreatiLayout-FLUX again has the best $B_k$ 2 at 86.39 and FID at 41.51 (Li et al., 23 Sep 2025). On OverLayBench-Complex, CreatiLayout-FLUX leads on mIoU at 54.50 and O-mIoU at 28.97, DreamRender leads on $B_k$ 3 at 77.87 and on both CLIP $B_k$ 4 and CLIP $B_k$ 5 at 36.75 and 27.54, and CreatiLayout-FLUX again leads on $B_k$ 6 at 86.45 and FID at 45.66 (Li et al., 23 Sep 2025).

Training-free methods are reported as much weaker overall. On the Complex split, RegionalPrompting reaches mIoU 28.35 and O-mIoU 9.05, compared with CreatiLayout-FLUX at 54.50 and 28.97 (Li et al., 23 Sep 2025). This supports the paper’s claim that overlap-heavy grounded generation remains difficult without dedicated training.

These findings place OverLayBench in a useful complementary relation to broader layout benchmarks. 7Bench showed that overlapping boxes were less problematic than expected in its scenario decomposition, with small boxes often emerging as harder (Izzo et al., 18 Aug 2025). OverLayBench instead argues that dense overlaps, especially when semantically similar, are a distinct regime of difficulty and that prior benchmark distributions under-sample it (Li et al., 23 Sep 2025). The difference is not necessarily contradictory: it suggests that “overlap” is not a monolithic condition, and that severity, density, and semantic similarity materially change its difficulty.

6. CreatiLayout-AM, limitations, and benchmark significance

As an initial step toward improving performance on complex overlaps, the paper proposes CreatiLayout-AM, a variant of CreatiLayout fine-tuned with amodal mask supervision (Li et al., 23 Sep 2025). The motivation is that an amodal mask represents the full shape of an object, including occluded parts, and that supervision on complete object extents may help separate overlapping instances and reduce fusion artifacts.

The training set is constructed from FLUX-generated images. SAM 2 is used to extract amodal object masks, individual objects are cropped into an object-mask pool $B_k$ 7, an object from $B_k$ 8 is pasted into a target image to create synthetic occlusion with an existing object, and Qwen-2.5-VL-32B generates global image captions and local instance descriptions for both original and pasted objects (Li et al., 23 Sep 2025). The final training set contains approximately 67.8k images. CreatiLayout-AM is built on CreatiLayout-SD3 and adds a token-level alignment loss and a pixel-level alignment loss, producing the objective

$B_k$ 9

with fine-tuning for 3,500 steps on 8 NVIDIA RTX A6000 (48GB), learning rate $k$ 0, AdamW, bf16 precision, LoRA rank 32, warm-up steps 500, linear scheduler, $k$ 1, and $k$ 2 (Li et al., 23 Sep 2025).

The gains are strongest on the easier splits. On the Simple split, CreatiLayout-AM improves over CreatiLayout from mIoU 58.78 to 61.16 and from O-mIoU 32.52 to 37.69, corresponding to reported relative improvements of +4.05% and +15.90% (Li et al., 23 Sep 2025). On the Regular split, O-mIoU improves from 20.67 to 21.79, a reported +5.42% (Li et al., 23 Sep 2025). On the Complex split, the gains are marginal: O-mIoU rises from 18.05 to 18.07, while several other metrics decline slightly (Li et al., 23 Sep 2025). The authors interpret this as a distribution shift problem, indicating that the hardest benchmark cases remain harder than the synthetic amodal-mask training distribution.

This limitation is consistent with the benchmark’s broader significance. OverLayBench shows that dense overlap is an underexplored but important failure mode in layout-to-image generation, that existing benchmarks are biased toward easier examples, and that overlap-region metrics such as O-mIoU reveal failures that ordinary global metrics can understate (Li et al., 23 Sep 2025). It also exposes important dependencies: the annotation pipeline relies heavily on Qwen for captioning, grounding, and relationship extraction, and some evaluation components, including predicted boxes and entity or relationship success rates, are evaluator-dependent because they are derived using Qwen-2.5-VL-32B (Li et al., 23 Sep 2025).

Within the benchmark landscape, OverLayBench is best understood as a specialized complement to more general semantic-spatial evaluation suites. 7Bench established a seven-scenario diagnostic benchmark for layout-guided text-to-image generation with paired semantic and spatial metrics (Izzo et al., 18 Aug 2025). C-Bench and O-Bench extended that line with closed-set and open-set evaluation and a unified harmonic-mean ranking metric (Parolari et al., 28 Apr 2026). OverLayBench narrows the focus to dense overlaps, introduces a difficulty score tailored to spatial-semantic entanglement, and supplies overlap-specific metrics and curated annotations for that regime (Li et al., 23 Sep 2025). This suggests a division of labor across benchmarks: broad layout-faithfulness suites diagnose multiple controllability dimensions, whereas OverLayBench isolates the overlap-heavy cases in which instance disentanglement, occlusion reasoning, and local compositional fidelity become the decisive bottlenecks.