SpatialReward-Dataset for Spatial-Aware Rewards
- SpatialReward-Dataset is a series of spatially grounded datasets used for reward modeling in image editing, text-to-image generation, and layout synthesis.
- It encompasses distinct collections like SpatialReward-260k, an 80k T2I-focused set, and DesignSense-10k, each featuring rigorous annotations and diverse spatial schemas.
- The datasets employ multi-stage annotation protocols and precise evaluation metrics to enhance fine-grained spatial reasoning in generative models.
SpatialReward-Dataset refers to several spatially grounded datasets used for reward modeling in image editing, text-to-image (T2I) generation, and graphic layout synthesis. These datasets underpin new reward models based on explicit or preference-driven spatial reasoning, addressing challenges in fine-grained alignment and perceptual evaluation. Notably, the term encompasses: (1) the SpatialReward-260k dataset for online RL in image editing (Long et al., 7 Feb 2026); (2) the T2I-oriented SpatialReward-Dataset (80k pairs) for SpatialScore (Tang et al., 27 Feb 2026); (3) the DesignSense-10k dataset for layout preference modeling (Gopal et al., 26 Feb 2026). Each dataset is characterized by rigorous annotation protocols, spatially detailed schemas, and direct usage for training/evaluating spatial-aware reward models.
1. Dataset Composition and Construction
SpatialReward-260k (Image Editing, (Long et al., 7 Feb 2026))
- Scale and Structure: 260,000 image–edit pairs in the supervised fine-tuning (SFT) pool; partitioned as 100,000 Refined EditScore, 100,000 Re-purposed EditReward (with regenerated fine-grained reasoning), and 60,000 custom Multi-Edit samples, the latter organized according to a 15-subtask taxonomy.
- Task Distribution: In the custom Multi-Edit subset, the distribution is 2:1:1 (General Editing : Human-Centric Basics : Human-Centric Fine Details). Legacy sets inherit a wide task mix (style transfer, additions, removals, and compositional edits).
- Source Corpus: Legacy subsets are built from public instruction–edit corpora (InstructPix2Pix, EditReward). Multi-Edit samples draw from laion2B-en-aesthetic, CC12M, and a curated portrait set, with a focus on high-resolution and high-clarity editable regions.
- Selection Criteria: High-res, clearly editable, no abstraction/text-only; human-centric subtasks require visible, front-facing faces.
SpatialReward-Dataset (T2I Preference, (Tang et al., 27 Feb 2026))
- Scale and Structure: 80,000 adversarial preference pairs plus a 365-pair held-out evaluation set.
- Sample Format: Each record includes a prompt , two generated images ( “perfect”, “perturbed”), a binary preference (winner), list of instantiated spatial relations, and model provenance.
- Spatial Relation Taxonomy: 12–15 types (e.g., LeftOf, RightOf, Above, Below, InFrontOf, Behind, Between, Inside, OnTopOf, SurroundedBy), with each prompt averaging 2–4 predicates.
DesignSense-10k (Graphic Layouts, (Gopal et al., 26 Feb 2026))
- Scale and Splits: 10,235 human-annotated preference pairs; split as 8,735 train, 500 val, 1,000 test.
- Annotation Diversity: Four-class scheme (left better, right better, both good, both bad). Aspect ratios varied systematically; layouts span 4–20 elements, with post-group grouping yielding 2–8 semantic clusters.
2. Annotation Protocols and Quality Control
Image Editing (Long et al., 7 Feb 2026)
- Three-Stage Pipeline:
- Spatial Grounding: Qwen-3-VL-235B predicts edit-object bounding boxes, normalized to [0,1000] grid.
- Expert Routing: Human-centric edits routed to Gemini-2.5-Pro; general edits to GPT-5, both using visual overlays, generating chain-of-thought (T_raw) and semantic/perceptual quality scores ().
- Alignment and Verification: Qwen-3-VL refines reasoning, interleaves <|bbox_id|>, <|global|> tokens; samples with rationale/box misalignment flagged for removal.
- QC Measures: Correct box count by edit number; stringent hallucination filter; for MER-Bench, 5-expert consensus (Cohen’s ).
T2I Preference (Tang et al., 27 Feb 2026)
- Prompt Engineering: GPT-5 generates a “clean” prompt, then perturbed variants via relation flipping/swapping.
- Image Generation: SOTA T2I backends (Qwen-Image, HunyuanImage-2.1, Seedream 4.0).
- Human Verification: Dual annotators per pair; discordant cases resolved by a third; ambiguous/misaligned samples discarded.
- Agreement: Cohen’s over 2,000 pairs.
Graphic Layouts (Gopal et al., 26 Feb 2026)
- Pipeline Stages: Semantic grouping (GPT-4o); layout prediction (AesthetiQ with sampling); filtering (overlap/outlier removal); IoU-based clustering; VLM refinement (“minimize overlaps”, “align edges”).
- Annotation: MTurk plus author verification; four-way judgment; gold-standard pairs for QC; class distribution controlled for imbalance.
3. Spatially-Aware Representations and File Formats
| Dataset | Representation | Main Format/Fields |
|---|---|---|
| SpatialReward-260k | Bounding-boxes | JSONL: image_id, instruction, orig_path, edit_path, edit_region, reasoning, score_sc, score_pq |
| SpatialReward-Dataset | Paired preference | JSONL: pair_id, prompt, perfect_image, perturbed_image, winner, relations, model |
| DesignSense-10k | Layout preference | Structure: paired layout images, 4-class annotation, semantic grouping metadata, bounding boxes |
In all cases, spatial relations are explicit: via bounding boxes (editing/layouts) or relation predicate lists (T2I). File layouts support efficient batch parsing. Inspired by COCO for editing (no polygon masks), and structured for pairwise ranking (T2I/layout).
4. Dataset Statistics and Quantitative Properties
SpatialReward-260k (Long et al., 7 Feb 2026)
- Mean Boxes/Image: 1.15 ()
- Mean Region Area: 8% ()
- Multi-Edit Task Distribution: 50% General, 25% Human-Basic, 25% Human-Fine
SpatialReward-Dataset (Tang et al., 27 Feb 2026)
- Relation Distribution: LeftOf/RightOf (30%), Above/Below (25%), InFrontOf/Behind (15%), Between (10%), Inside/OnTopOf/Under (10%), Near/Far/SurroundedBy (10%)
- Prompt Complexity: , (objects); , (relations)
- Prompt Length: Mean 27 tokens (range 10–60)
- Shannon Entropy (Relation Type): nats
DesignSense-10k (Gopal et al., 26 Feb 2026)
- Annotation Distribution: Left (28%), Right (30%), Both Good (25%), Both Bad (17%)
- Aspect Ratios: Cover log(width/height) in
- Element Count: 4–20 (peak 8–12), post-group 2–8
- ARI vs Human Grouping (one-shot GPT-4o): 0.69
5. Benchmarks, Evaluation Protocols, and Metrics
Image Editing (Long et al., 7 Feb 2026)
- Primary Benchmarks: EditReward-Bench (pointwise), MMRB2 (ImgEdit), MER-Bench (MultiEditReward-Bench, 600 eval groups/1,800 edits; accuracy and Kendall’s )
- Statistical Metrics: Intersection-over-Union (IoU) for box alignment:
Shannon entropy of attention maps ; entropy gap ; Pearson stability.
T2I Preference (Tang et al., 27 Feb 2026)
- Training Objective: Bradley–Terry pairwise ranking loss:
- Reward Model Evaluation: 95.8% pairwise accuracy on the 365-pair benchmark; superiority over GPT-5/Gemini-2.5 Pro. RL-trained models show clear gains in spatial accuracy across DPG-Bench, TIIF-Bench, and UniGenBench++, e.g., relation-spatial subtask jump: 0.871 → 0.932.
Graphic Layouts (Gopal et al., 26 Feb 2026)
- Evaluation Metrics: Macro F1 (mean per-class F1), Weighted F1, Binary Accuracy (left/right, 81% pilot), and Cohen’s .
- Results: DesignSense increases Macro F1 by 54.6% over best fewshot baseline; test Macro F1 0.456; weighted F1 0.520. Direct RL utilization improves layout generator win rates by 3–4 pp; inference-time scaling yields +3.6 pp against AesthetiQ baseline.
6. Access Methods and File Organization
SpatialReward-260k (Long et al., 7 Feb 2026)
- Repository: https://github.com/Kwai-Keye/SpatialReward
- Structure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
SpatialReward/
images/
train/orig/
train/edit/
rl/
val/
annotations/
train.jsonl
rl.jsonl
val.jsonl
MER-Bench/
groups_2.jsonl
groups_3.jsonl
groups_4.jsonl
README.md |
- Annotation Example:
1 2 3 4 5 6 7 8 9 10 11 12 |
{
"image_id":"SR_000123",
"orig_path":"images/train/orig/SR_000123.png",
"edit_path":"images/train/edit/SR_000123.png",
"instruction":"Change the fabric to silk.",
"edit_region":[
{"id":0,"label":"fabric","bbox_2d":[120,300,480,720]}
],
"reasoning":"<|bbox_0|> The fabric was changed to a smooth silk texture. <|global|> No other regions were modified.",
"score_sc":[23,21],
"score_pq":[24,25]
} |
SpatialReward-Dataset (Tang et al., 27 Feb 2026)
- Repository: https://dagroup-pku.github.io/SpatialT2I/
- Core files: JSONL (one pair/line), companion CSV, image folders.
DesignSense-10k (Gopal et al., 26 Feb 2026)
- Data format: Paired layouts, bounding boxes, association with semantic groups, CSV/JSONL as appropriate.
7. Impact and Related Research Directions
SpatialReward-Dataset variants establish spatial-aware reward modeling as a central paradigm for pixel-level and layout-level alignment in generative workflows. SpatialReward-260k’s explicit grounding supports state-of-the-art RL editing agents, resolving “Attention Collapse” by tethering semantic judgments to edit-localized evidence (Long et al., 7 Feb 2026). The T2I preference dataset enables precise evaluation of spatial-relation satisfaction in text-to-image systems; reward-driven RL leveraging this data yields measurable gains in multi-relation spatial fidelity (Tang et al., 27 Feb 2026). DesignSense-10k fills the gap in layout preference modeling, enabling robust, user-aligned evaluation and improvement in high-variance visual domains (Gopal et al., 26 Feb 2026).
A plausible implication is that such datasets, when tightly aligned with pixel/object-level evidence and subject to rigorous multi-stage validation, are indispensable for advancing the spatial fidelity of generative models and downstream alignment in complex image-based RL settings. Interoperability with existing format conventions (e.g., COCO-like layouts) facilitates extension to new domains, including UI design and robotics where explicit spatial reasoning is required.