CHIMERA Compositional Image Synthesis Dataset
- The CHIMERA dataset is a large-scale resource defined by 464 semantic atoms, enabling controlled synthesis of images with diverse, compositional object-part configurations.
- It employs a deterministic prompt engineering pipeline that integrates mixed-domain semantic atoms into 37,000 text-to-image pairs rendered with an open-source diffusion transformer.
- The dataset’s evaluation uses the PartEval metric to assess alignment accuracy and visual fidelity, facilitating reproducible comparisons in compositional generative modeling.
The CHIMERA Compositional Image Synthesis Dataset is a large-scale resource designed to systematically evaluate and train generative models for part-level, compositional image synthesis using explicit, text-based control over object-part assembly. The dataset, introduced by Singh et al. in "Chimera: Compositional Image Generation using Part-based Concepting" (Singh et al., 20 Oct 2025), consists of 37,000 synthetic images and annotated prompts, enabling the generation and assessment of images that combine arbitrary spatial configurations of object parts sourced from multiple categories, referred to as “semantic atoms”. The dataset supports fine-grained compositional generalization and zero-shot hybrid generation workflows.
1. Semantic Atom Taxonomy and Dataset Scope
CHIMERA is constructed around the notion of a "semantic atom," defined as an ordered pair ⟨part, subject⟩ (e.g., ⟨tail, lion⟩, ⟨keyboard, laptop⟩). The taxonomy encompasses six broad domains: creature, vehicle, furniture, plant, electronics, and instrument. Within each domain, eight visually-localized, functionally distinct part types are curated; for example, parts such as head, tail, limb for creatures, or wheel, door, hood for vehicles. Each part is associated with 6–19 categories ("subjects"), such as lion, panda, or motorcycle.
Aggregated across all domains and parts, the dataset enumerates a total of 464 unique semantic atoms, with the combinatorial design enabling the construction of hybrid “slot”-like prompts that mix and match arbitrary atoms within or across domains. This supports both controlled ablation experiments and broad generalization analysis unavailable in conventional text-to-image benchmarks.
2. Prompt Engineering and Image Synthesis Pipeline
Each dataset sample is generated by first randomly sampling 2–4 distinct semantic atoms, sometimes mixing domains to promote maximal compositional diversity. These atoms are instantiated in a prompt template with a category-level prefix:
“<Prefix> with the <Part₁> of a <Subject₁>, the <Part₂> of a <Subject₂>[, and the <Part₃> of a <Subject₃>[, and the <Part₄> of a <Subject₄>]].”
An example is: “A creature with the head of a lion, the tail of a monkey, and the fur of a panda.”
All 37,000 prompts are rendered into 1024×1024 pixel images using HiDream-I1-Full, an open-source text-to-image diffusion transformer model with 17 billion parameters. The authors specify sampling hyperparameters: 50 denoising steps, guidance scale of 5.0, and scheduler shift of 3.0. Deterministic seeding based on the prompt string ensures reproducible outputs for any given prompt. No post hoc curation, editing, or filtering is applied; generated outputs are taken as-is.
3. Dataset Composition, Balancing, and Partitioning
The total corpus size is 37,000 image–prompt pairs. The approximate distribution across top-level domains is as follows:
| Domain | Approximate Count | Example Parts/Subtypes |
|---|---|---|
| Creatures | ≈6,000 | head, tail, limb, body, wing, etc. [~12 species] |
| Vehicles | ≈6,000 | wheel, door, window, light, hood, etc. [~10 types] |
| Furniture | 5,000–7,000 | eight types distributed over various objects |
| Plants | 5,000–7,000 | eight types distributed over various species |
| Electronics | 5,000–7,000 | eight types distributed over categories |
| Instruments | 5,000–7,000 | eight types distributed over categories |
The dataset’s combinatorial sampling procedure draws semantic atoms uniformly to prevent over-concentration on any individual part or subject. No separate data augmentation techniques are employed beyond compositional mixing. While explicit train/validation/test splits are not specified, a plausible default is an 80/10/10% scheme (≈29,600/3,700/3,700), which aligns with research best practices for large-scale visual corpora. The absence of filtering suggests unbiased sampling with respect to prompt complexity and domain pairings.
4. Data Organization and Annotation Schema
Image data is arranged by usage split, with conventional file system organization (e.g., images/train/000001.png) and a parallel annotation directory. Each image is linked to a JSON metadata entry containing:
- Unique ID
- Full prompt text
- List of atoms: array of {"subject": …, "part": …}
- File path to the synthesized image
- Deterministic seed used for synthesis
- Optional: part-level bounding boxes or masks (“part_masks”), if region-level localization is subsequently computed via external vision models such as Florence or Grounding DINO.
Example JSON entry:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
{
"id": "000042",
"prompt_text": "A creature with the head of a lion, and the tail of a monkey.",
"atoms": [
{"subject": "lion", "part": "head"},
{"subject": "monkey","part": "tail"}
],
"image_path": "images/train/000042.png",
"seed": 123456789,
"part_masks": [
{"part": "head", "bbox": [x1, y1, x2, y2]},
{"part": "tail", "bbox": [x1, y1, x2, y2]}
]
} |
A typical prompt (e.g., "A vehicle with the wheels of a motorcycle, the door of a van, and the hood of a sedan") is linked to its compositional atoms and ground-truth masks where applicable. A plausible implication is that the underlying data structure facilitates seamless integration with compositional evaluation pipelines and part-aware generative training.
5. Evaluation: The PartEval Metric
For benchmarking compositional image synthesis models, the dataset is paired with the PartEval metric. Formally, PartEval computes a weighted sum of “alignment accuracy” (semantic and spatial correctness of synthesized parts) and “visual fidelity” (overall image realism), denoted:
The operationalization follows a three-stage multimodal LLM pipeline:
- Reference feature extraction: Object, Part, Color, Texture, SpatialRelation.
- Attribute-specific question generation (e.g., “Is the tail of the lion visible and correctly positioned?”).
- Automated grading via Gemini-Flash, scoring 1.0 for correct answers, 0.0 for incorrect; partial scores are averaged and normalized to [0,1].
The PartEval scalar score is computed over 200 held-out samples per compositional category and supports comparative evaluation of Chimera against alternative models on 2-, 3-, and 4-part hybrid prompts. Results indicate a 14% improvement in part alignment/compositional accuracy and 21% higher visual quality over baselines (Singh et al., 20 Oct 2025).
6. Data Accessibility, Licensing, and Use Cases
All 37,000 CHIMERA images are synthetic, generated exclusively with the open-source HiDream-I1-Full model. No private or copyrighted content is incorporated. The authors intend to release the dataset and accompanying codebase under an academic open license, consistent with ICLR reproducibility standards; a permissive license such as Creative Commons Attribution (CC-BY 4.0) or MIT is suitable for research and development purposes.
Researchers may cite the corpus as:
Singh et al., “Chimera: Compositional Image Generation using Part-based Concepting,” ICLR 2026.
The resource is positioned for immediate downstream use in fine-tuning part-aware diffusion models, compositional prompt engineering, and automated evaluation of complex visual hybrid composition in generative pipelines using the provided PartEval metric.