When Do Diffusion Models learn to Generate Multiple Objects?

Published 30 Apr 2026 in cs.CV and cs.AI | (2605.00273v1)

Abstract: Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces the MOSAIC diagnostic framework to analyze and quantify multi-object compositional failures in diffusion models.
It reveals that scene complexity and lack of targeted spatial priors critically hinder accurate counting and spatial relation tasks.
Empirical results demonstrate that data scaling alone is insufficient, highlighting the need for architectural interventions to improve compositional generalization.

Diagnostic Analysis of Multi-Object Generation Failures in Diffusion Models

Problem Setting and Motivation

Despite their strong visual fidelity, text-to-image diffusion models exhibit persistent failures in multi-object compositional generation, manifesting as a pronounced inability to reliably generate images involving multiple object instances, object-attribute bindings, counting, or complex spatial relations. These deficiencies are strongly pronounced relative to single-object scenarios, with generation accuracy frequently dropping well below 50% in benchmark evaluations on compositional tasks. Prior mitigation strategies have largely focused on architectural or control-based interventions rather than rigorous causal analysis of the data-driven origins of these limitations.

This work addresses two core research questions using controlled synthetic datasets:

(RQ1) Concept Generalization: When every atomic concept (e.g., object, color, count) appears in training, but potentially with skewed frequencies, can models reliably instantiate each, and how does data imbalance affect learning?
(RQ2) Compositional Generalization: When all concepts are sufficiently covered but a subset of their joint compositions are held out during training, do models generalize compositionally to these unseen combinations? What is the effect of compositional hold-outs and dataset size?

The MOSAIC Diagnostic Framework

To systematically analyze these questions in a causal manner, the authors introduce MOSAIC (Multi-Object Spatial relations, Attribution, Counting), a fully-controlled data generation framework. MOSAIC parameterizes scene complexity—including the number of objects, object identities, spatial layouts, and attribute assignments—while enabling direct manipulation of concept frequencies and joint compositional presence. Three primary tasks are isolated:

Attribution: Correct binding of attributes to object instances (e.g., “black sphere and red cube”), probing attribute assignment in the presence of distractors and concept imbalance.
Spatial Relations: Generation of controlled relative arrangements between object pairs (e.g., discrete angular bins separating two objects), designed to specifically test reasoning over spatial relationships.
Counting: Faithful instantiation of an explicit number of objects, where scene complexity increases with the object count, directly testing numeracy under generative constraints.

MOSAIC supports uniform vs. skewed distributions for atomic concepts and the targeted removal of compositional pairs for combinatorial generalization analysis. Scene complexity is systematically adjustable, and spatial priors can be explicitly injected via grid layouts.

Experimental Design and Evaluation Methodology

Two diffusion backbones are considered: a classical U-Net architecture and a Diffusion Transformer (DiT), each trained on varying sizes of MOSAIC and its variants, with conditioning delivered via one-hot-encoded tokens. Evaluation leverages robust discriminative classifiers (ResNet or simple CNN backbones for different subtasks), reporting accuracy on both atomic and compositional test conditions, along with memorization rates (for detecting trivial memorization effects).

Concept imbalance is modeled after frequency distributions observed in large-scale web data (e.g., LAION-2B).
Compositional generalization is systematically controlled by removing diagonals in the concept × concept combination matrix (e.g., specific color/count or color/angle pairs).

Fine-tuning experiments with state-of-the-art pretrained models (e.g., Stable Diffusion 3 via LoRA) further test external validity on realistic data regimes.

Main Empirical Results

Concept Generalization

Scene Complexity Dominates: For both Attribution and Spatial Relations, high generalization performance is achieved across all dataset sizes and skews, provided scene complexity remains low (i.e., few objects). However, Counting exhibits unique brittleness: at small to moderate data scales (2k–50k), none of the architectures reliably learn to count, with pronounced early peak and subsequent collapse in accuracy before eventual recovery only at the largest data scales.
Imbalance is Secondary: Data skewness alone has marginal effect on generalization once per-concept coverage is assured and dataset size is large. Scene complexity plays a much stronger role in determining the emergence of reliable counting behavior, especially in low-data regimes.
Spatial Priors Aid Numeracy: Enforcing explicit grid layouts (reducing spatial variance) markedly improves counting performance across all data regimes, suggesting that architectural or data-driven spatial priors constitute an essential inductive bias for compositional multi-object tasks.

Compositional Generalization

Collapse with Held-Out Combinations: When increasing numbers of atomic concept pairs are deliberately omitted during training, diffusion models exhibit sharp declines in generalization to unseen compositions, even when data scale increases. The degradation aligns with the degree of combinatorial hold-out.
Ordering of Difficulty: There is a consistent empirical hierarchy: Attribution is the easiest (with error patterns local and semantically meaningful), Counting is intermediate (systematic over- or under-counting by one), and Spatial Relations is the hardest (confusion is widespread, with no clear error structures). This reflects the geometric and relational complexity unique to spatial binding.
Minimal Benefit from Data Scale/Condition Encoders Alone: Larger datasets provide some benefit but do not eliminate the performance gap. Attempts to strengthen condition encoding (e.g., frozen, disentangled, supervised encoders) provide only marginal improvements, indicating failures are primarily architectural rather than stemming from encoder entanglement.

Extension to Realistic Visual Regimes

Parallel Trends Under Fine-Tuning and Visual Complexity: Using LoRA-based fine-tuning on more realistic datasets (e.g., SPEC, Comfort-Car), the key phenomena (fragility of counting, compositional collapse) persist. Scene realism, occlusion, and intra-scene diversity do not resolve compositional limitations observed in MOSAIC.

Implications and Theoretical Insights

Data-driven Solutions Insufficient: These findings demonstrate that scale, coverage, and distributional control over the data are necessary but not sufficient for robust multi-object compositional generalization in current diffusion architectures. The lack of targeted architectural inductive biases, especially for spatial and numeracy reasoning, constitutes a fundamental limitation.
Critical Need for Inductive Priors: For tasks such as counting and spatial relation reasoning, explicit architectural mechanisms (e.g., spatial priors, attention steering, scene graphs, and object-centric reasoning modules) or strong task-specific priors must be considered as necessary to achieve robustness, especially when compositional coverage is incomplete.
Practical Risk for Deployments: For downstream applications relying on reliable multi-object generation (e.g., simulation, content creation, education), naive reliance on data scaling or coverage is not sufficient—one must anticipate failure modes driven by both scene complexity and lack of compositional bias in the model.

Conclusion

This work provides a thorough causal dissection of the data-driven and architectural sources underpinning failures of current text-to-image diffusion models in the multi-object, compositional regime. Using a controlled diagnostic benchmark, it establishes that limitations in concept and compositional generalization are fundamentally tied to scene complexity, insufficient architectural priors, and the exponential explosion of compositional variants. Current models lack the ability to "bind and recombine" complex concepts reliably under compositional shift, especially for counting and spatial relations. These results motivate a new research direction focused on strong inductive biases and architectural innovations that target the compositional bottlenecks revealed by this analysis.

Future theoretical and practical advances should explicitly address the inductive requirements for compositionality, and benchmarks such as MOSAIC provide the necessary diagnostic foundation for robust evaluation and ablation. Model architectures should consider integration of scene-level structure, explicit relational modules, or object-centric decomposition to achieve scalable, reliable, and interpretable multi-object generative modeling.

Markdown Report Issue