SpatialT2I: Benchmarking Spatial Intelligence
- SpatialT2I is a systematically curated dataset that enhances spatial intelligence in text-to-image models using information-dense image-text pairs.
- The dataset features 15,400 high-quality samples with detailed annotations across 10 spatial sub-domains, validated through human refinement.
- It enables rigorous fine-tuning and evaluation of T2I models, demonstrating measurable gains in positional, layout, and comparison accuracy.
SpatialT2I is a systematically curated dataset designed to advance the spatial intelligence of text-to-image (T2I) models. Developed as part of the "Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models" framework, SpatialT2I addresses limitations of existing T2I datasets and benchmarks by providing information-dense prompts paired with carefully matched images, enabling rigorous evaluation and fine-tuning of spatial reasoning capabilities across complex real-world scenes (Wang et al., 28 Jan 2026).
1. Motivation and Context
Despite significant advancements in T2I model fidelity, models often fail to capture complex spatial relationships within generated imagery—including object position, layout, occlusion, and causal interaction. Existing benchmarks, such as SR₂D (Gokhale et al., 2022), focus primarily on simple 2D spatial predicates in synthetic scenarios, offering limited diagnostic resolution for higher-order spatial reasoning. SpatialT2I was constructed to fill this evaluation gap by targeting richer spatial domain coverage and providing context-preserving, information-dense image–text pairs specifically for model supervision and assessment.
2. Construction Methodology
SpatialT2I builds on the SpatialGenEval benchmark, which initially contains 1,230 information-rich prompts and 12,300 multiple-choice QA pairs spanning 10 spatial sub-domains: object identity, attributes, position, orientation, layout, comparison, proximity, occlusion, motion, and causality. Prompt and QA synthesis was handled by Gemini 2.5 Pro, followed by human refinement for fluency, logical consistency, and vocabulary simplicity. Prompts exhibiting design or domain ambiguity were filtered, yielding 1,100 high-quality prompts.
Each high-quality prompt was paired with images generated from 14 top-performing T2I models (each screening for average spatial-QA accuracy above 50%), resulting in 15,400 text–image pairs. Misalignments between prompts and model outputs were detected using a multimodal LLM (MLLM), also Gemini 2.5 Pro, which proceeded to rewrite each prompt to ensure the resultant image precisely matched the spatial annotations without reducing information density.
3. Dataset Structure and Annotation Schema
The dataset consists of 15,400 samples distributed across 22 real-world scene categories (excluding 130 "Design" scenes from the base benchmark). Each sample is exhaustively annotated for all 10 spatial sub-domains via 10 multi-choice QA entries. The annotation format employs a standardized JSON schema for every record:
- id: Zero-padded unique integer mapping to image filename (e.g., 000123.png)
- scene: One of 22 scene names, lowercased
- image_path: Relative file path to PNG image (≈1024×1024 resolution)
- original_prompt: LLM-generated prompt pre-rewriting
- rewritten_prompt: Gemini-refined prompt consistent with the image
- qas: List of 10 dictionaries, each with
- subdomain: One value from {"object", "attribute", ..., "causal"}
- question: String
- choices: List containing options A–E (“E: None” supports model refusal)
- answer: Correct choice as letter {A,B,C,D}
QA guidelines prohibit answer leakage and enforce precise one-to-one sub-domain targeting.
4. Data Storage, Access, and Usage
SpatialT2I is structured for scalable usage in machine learning workflows:
- Images: Stored in a flat directory with zero-padded sequential numeric naming.
- Metadata: Distributed as a single JSONL file (metadata.jsonl) matching the image IDs.
- Distribution: Released publicly via GitHub (https://github.com/AMAP-ML/SpatialGenEval).
- Loading: Usable in Python via
jsonlinesorpandas.read_json(..., lines=True).
No explicit train/val/test split is provided. All 15,400 pairs are intended for supervised fine-tuning.
5. Fine-Tuning Protocol and Experimental Results
SpatialT2I's primary utility is as a fine-tuning corpus for T2I foundation models. The canonical protocol involves:
- Models: Stable Diffusion-XL (SD-XL), UniWorld-V1, OmniGen2
- Training regime:
- All 15,400 (prompt, image) pairs used for supervised learning
- Optimizer: AdamW, learning rate
- Batch size: 16, 3 epochs, 8 × A100 GPUs
- Loss: with ,
- : Pixel or latent reconstruction loss
- : Cross-entropy between model’s predicted QA answers and ground-truth
- Prompt conditioning and classifier-free guidance mirror baseline configurations
- Early stopping based on held-out SpatialGenEval accuracy
Evaluation is performed using Qwen2.5-VL-72B in a zero-shot multi-choice VQA setup, with five-round self-consistency voting (answer agreement in at least 4/5 rounds). Performance is reported as accuracy (%) averaged over all 10 spatial sub-domains.
| Model | Baseline Accuracy | Fine-Tuned Accuracy | Absolute Gain (pp) |
|---|---|---|---|
| SD-XL | 41.2% | 45.4% | +4.2 |
| UniWorld-V1 | 54.2% | 59.9% | +5.7 |
| OmniGen2 | 56.4% | 60.8% | +4.4 |
Improvements were most significant in the position, layout, and comparison domains; smaller, yet positive, effects were observed for occlusion and causal queries, indicating persistent bottlenecks in visual reasoning (Wang et al., 28 Jan 2026).
6. Applications, Limitations, and Comparative Positioning
Potential applications span creative design tools with strict spatial constraints, spatially aware robotics and embodied agents, and the development of enhanced prompt engineering guides for downstream T2I tasks. The data-centric protocol—systematic refinement of both prompts and images—demonstrates tangible spatial fidelity gains over prior approaches.
SpatialT2I covers only 22 scene types and is limited to static single-frame images; scaling to more varied domains (e.g., medical, industrial) or temporal reasoning remains unaddressed. The inclusion of humans-in-the-loop for prompt and QA refinement imposes bottlenecks for scalability.
In comparison, datasets like SR₂D (Gokhale et al., 2022) focus on 2D spatial predicates with fully synthetic prompts and automated annotation, while SpatialT2I ensures human-validated, information-rich spatial content across a broader and more semantically varied context. SpatialT2I's QA-guided fine-tuning yields improvement in compositional spatial reasoning not observed with previous datasets.
7. Prospective Directions and Extensions
Future directions for SpatialT2I include:
- Extension to text-to-video to address spatio-temporal reasoning
- Broadening sub-domain coverage (e.g., explicit modeling of fluid or lighting causality)
- Incorporating curriculum-style fine-tuning or reinforcement learning from MLLM-derived feedback
- Establishing a public leaderboard to benchmark incremental model improvements on both SpatialGenEval and SpatialT2I frameworks
A plausible implication is that scaling up human-in-the-loop information-dense annotation and leveraging MLLM-based feedback can push the next frontier in grounded and compositional image generation.