SpatialReward-Dataset for Spatial-Aware Rewards

Updated 3 March 2026

SpatialReward-Dataset is a series of spatially grounded datasets used for reward modeling in image editing, text-to-image generation, and layout synthesis.
It encompasses distinct collections like SpatialReward-260k, an 80k T2I-focused set, and DesignSense-10k, each featuring rigorous annotations and diverse spatial schemas.
The datasets employ multi-stage annotation protocols and precise evaluation metrics to enhance fine-grained spatial reasoning in generative models.

SpatialReward-Dataset refers to several spatially grounded datasets used for reward modeling in image editing, text-to-image (T2I) generation, and graphic layout synthesis. These datasets underpin new reward models based on explicit or preference-driven spatial reasoning, addressing challenges in fine-grained alignment and perceptual evaluation. Notably, the term encompasses: (1) the SpatialReward-260k dataset for online RL in image editing (Long et al., 7 Feb 2026); (2) the T2I-oriented SpatialReward-Dataset (80k pairs) for SpatialScore (Tang et al., 27 Feb 2026); (3) the DesignSense-10k dataset for layout preference modeling (Gopal et al., 26 Feb 2026). Each dataset is characterized by rigorous annotation protocols, spatially detailed schemas, and direct usage for training/evaluating spatial-aware reward models.

1. Dataset Composition and Construction

Scale and Structure: 260,000 image–edit pairs in the supervised fine-tuning (SFT) pool; partitioned as 100,000 Refined EditScore, 100,000 Re-purposed EditReward (with regenerated fine-grained reasoning), and 60,000 custom Multi-Edit samples, the latter organized according to a 15-subtask taxonomy.
Task Distribution: In the custom Multi-Edit subset, the distribution is 2:1:1 (General Editing : Human-Centric Basics : Human-Centric Fine Details). Legacy sets inherit a wide task mix (style transfer, additions, removals, and compositional edits).
Source Corpus: Legacy subsets are built from public instruction–edit corpora (InstructPix2Pix, EditReward). Multi-Edit samples draw from laion2B-en-aesthetic, CC12M, and a curated portrait set, with a focus on high-resolution and high-clarity editable regions.
Selection Criteria: High-res, clearly editable, no abstraction/text-only; human-centric subtasks require visible, front-facing faces.

Scale and Structure: 80,000 adversarial preference pairs plus a 365-pair held-out evaluation set.
Sample Format: Each record includes a prompt $c$ , two generated images ( $y_1$ “perfect”, $y_2$ “perturbed”), a binary preference (winner), list of instantiated spatial relations, and model provenance.
Spatial Relation Taxonomy: 12–15 types (e.g., LeftOf, RightOf, Above, Below, InFrontOf, Behind, Between, Inside, OnTopOf, SurroundedBy), with each prompt averaging 2–4 predicates.

Scale and Splits: 10,235 human-annotated preference pairs; split as 8,735 train, 500 val, 1,000 test.
Annotation Diversity: Four-class scheme (left better, right better, both good, both bad). Aspect ratios varied systematically; layouts span 4–20 elements, with post-group grouping yielding 2–8 semantic clusters.

2. Annotation Protocols and Quality Control

Three-Stage Pipeline:

Spatial Grounding: Qwen-3-VL-235B predicts edit-object bounding boxes, normalized to [0,1000] grid.
Expert Routing: Human-centric edits routed to Gemini-2.5-Pro; general edits to GPT-5, both using visual overlays, generating chain-of-thought (T_raw) and semantic/perceptual quality scores ( $s_{if}, s_{con}, s_{nat}, s_{art}$ ).
Alignment and Verification: Qwen-3-VL refines reasoning, interleaves <|bbox_id|>, <|global|> tokens; samples with rationale/box misalignment flagged for removal.

QC Measures: Correct box count by edit number; stringent hallucination filter; for MER-Bench, 5-expert consensus (Cohen’s $\kappa>0.8$ ).

Prompt Engineering: GPT-5 generates a “clean” prompt, then perturbed variants via relation flipping/swapping.
Image Generation: SOTA T2I backends (Qwen-Image, HunyuanImage-2.1, Seedream 4.0).
Human Verification: Dual annotators per pair; discordant cases resolved by a third; ambiguous/misaligned samples discarded.
Agreement: Cohen’s $\kappa=0.82$ over 2,000 pairs.

Pipeline Stages: Semantic grouping (GPT-4o); layout prediction (AesthetiQ with sampling); filtering (overlap/outlier removal); IoU-based clustering; VLM refinement (“minimize overlaps”, “align edges”).
Annotation: MTurk plus author verification; four-way judgment; gold-standard pairs for QC; class distribution controlled for imbalance.

3. Spatially-Aware Representations and File Formats

Dataset	Representation	Main Format/Fields
SpatialReward-260k	Bounding-boxes	JSONL: image_id, instruction, orig_path, edit_path, edit_region, reasoning, score_sc, score_pq
SpatialReward-Dataset	Paired preference	JSONL: pair_id, prompt, perfect_image, perturbed_image, winner, relations, model
DesignSense-10k	Layout preference	Structure: paired layout images, 4-class annotation, semantic grouping metadata, bounding boxes

In all cases, spatial relations are explicit: via bounding boxes (editing/layouts) or relation predicate lists (T2I). File layouts support efficient batch parsing. Inspired by COCO for editing (no polygon masks), and structured for pairwise ranking (T2I/layout).

4. Dataset Statistics and Quantitative Properties

Mean Boxes/Image: $\approx$ 1.15 ( $\sigma\approx0.42$ )
Mean Region Area: $\approx$ 8% ( $\sigma\approx 5\%$ )
Multi-Edit Task Distribution: 50% General, 25% Human-Basic, 25% Human-Fine

Relation Distribution: LeftOf/RightOf (30%), Above/Below (25%), InFrontOf/Behind (15%), Between (10%), Inside/OnTopOf/Under (10%), Near/Far/SurroundedBy (10%)
Prompt Complexity: $\mu_{\text{obj}}=4.8$ , $\sigma=1.2$ (objects); $\mu_{\text{rel}}=2.4$ , $\sigma=1.1$ (relations)
Prompt Length: Mean 27 tokens (range 10–60)
Shannon Entropy (Relation Type): $H\approx2.20$ nats

Annotation Distribution: Left (28%), Right (30%), Both Good (25%), Both Bad (17%)
Aspect Ratios: Cover log $_2$ (width/height) in $[-2,2]$
Element Count: 4–20 (peak 8–12), post-group 2–8
ARI vs Human Grouping (one-shot GPT-4o): 0.69

5. Benchmarks, Evaluation Protocols, and Metrics

Primary Benchmarks: EditReward-Bench (pointwise), MMRB2 (ImgEdit), MER-Bench (MultiEditReward-Bench, 600 eval groups/1,800 edits; accuracy and Kendall’s $\tau$ )
Statistical Metrics: Intersection-over-Union (IoU) for box alignment:

$\operatorname{IoU}(B_{pred},B_{gt}) = \frac{|B_{pred}\cap B_{gt}|}{|B_{pred}\cup B_{gt}|}$

Shannon entropy of attention maps $H(A) = -\sum_{i,j}A_{ij}\log A_{ij}$ ; entropy gap $|\Delta H| = |H(A_{src})-H(A_{edit})|$ ; Pearson stability.

Training Objective: Bradley–Terry pairwise ranking loss:

$\mathcal{L} = -\mathbb{E}_{(c,y_w,y_l)} [ \log\sigma(R(H(c,y_w)) - R(H(c,y_l))) ]$

Reward Model Evaluation: 95.8% pairwise accuracy on the 365-pair benchmark; superiority over GPT-5/Gemini-2.5 Pro. RL-trained models show clear gains in spatial accuracy across DPG-Bench, TIIF-Bench, and UniGenBench++, e.g., relation-spatial subtask jump: 0.871 → 0.932.

Evaluation Metrics: Macro F1 (mean per-class F1), Weighted F1, Binary Accuracy (left/right, 81% pilot), and Cohen’s $\kappa$ .
Results: DesignSense increases Macro F1 by 54.6% over best fewshot baseline; test Macro F1 0.456; weighted F1 0.520. Direct RL utilization improves layout generator win rates by 3–4 pp; inference-time scaling yields +3.6 pp against AesthetiQ baseline.

6. Access Methods and File Organization

Repository: https://github.com/Kwai-Keye/SpatialReward
Structure:

SpatialReward/
    images/
        train/orig/
        train/edit/
        rl/
        val/
    annotations/
        train.jsonl
        rl.jsonl
        val.jsonl
    MER-Bench/
        groups_2.jsonl
        groups_3.jsonl
        groups_4.jsonl
    README.md

Annotation Example:

{
  "image_id":"SR_000123",
  "orig_path":"images/train/orig/SR_000123.png",
  "edit_path":"images/train/edit/SR_000123.png",
  "instruction":"Change the fabric to silk.",
  "edit_region":[
    {"id":0,"label":"fabric","bbox_2d":[120,300,480,720]}
  ],
  "reasoning":"<|bbox_0|> The fabric was changed to a smooth silk texture. <|global|> No other regions were modified.",
  "score_sc":[23,21],
  "score_pq":[24,25]
}

Repository: https://dagroup-pku.github.io/SpatialT2I/
Core files: JSONL (one pair/line), companion CSV, image folders.

Data format: Paired layouts, bounding boxes, association with semantic groups, CSV/JSONL as appropriate.

SpatialReward-Dataset variants establish spatial-aware reward modeling as a central paradigm for pixel-level and layout-level alignment in generative workflows. SpatialReward-260k’s explicit grounding supports state-of-the-art RL editing agents, resolving “Attention Collapse” by tethering semantic judgments to edit-localized evidence (Long et al., 7 Feb 2026). The T2I preference dataset enables precise evaluation of spatial-relation satisfaction in text-to-image systems; reward-driven RL leveraging this data yields measurable gains in multi-relation spatial fidelity (Tang et al., 27 Feb 2026). DesignSense-10k fills the gap in layout preference modeling, enabling robust, user-aligned evaluation and improvement in high-variance visual domains (Gopal et al., 26 Feb 2026).

A plausible implication is that such datasets, when tightly aligned with pixel/object-level evidence and subject to rigorous multi-stage validation, are indispensable for advancing the spatial fidelity of generative models and downstream alignment in complex image-based RL settings. Interoperability with existing format conventions (e.g., COCO-like layouts) facilitates extension to new domains, including UI design and robotics where explicit spatial reasoning is required.

Markdown Report Issue Upgrade to Chat

References (3)

SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning (2026)

Enhancing Spatial Understanding in Image Generation via Reward Modeling (2026)

DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpatialReward-Dataset.

SpatialReward-Dataset for Spatial-Aware Rewards

1. Dataset Composition and Construction

SpatialReward-260k (Image Editing, (Long et al., 7 Feb 2026))

SpatialReward-Dataset (T2I Preference, (Tang et al., 27 Feb 2026))

DesignSense-10k (Graphic Layouts, (Gopal et al., 26 Feb 2026))

2. Annotation Protocols and Quality Control

Image Editing (Long et al., 7 Feb 2026)

T2I Preference (Tang et al., 27 Feb 2026)

Graphic Layouts (Gopal et al., 26 Feb 2026)

3. Spatially-Aware Representations and File Formats

4. Dataset Statistics and Quantitative Properties

SpatialReward-260k (Long et al., 7 Feb 2026)

SpatialReward-Dataset (Tang et al., 27 Feb 2026)

DesignSense-10k (Gopal et al., 26 Feb 2026)

5. Benchmarks, Evaluation Protocols, and Metrics

Image Editing (Long et al., 7 Feb 2026)

T2I Preference (Tang et al., 27 Feb 2026)

Graphic Layouts (Gopal et al., 26 Feb 2026)

6. Access Methods and File Organization

SpatialReward-260k (Long et al., 7 Feb 2026)

SpatialReward-Dataset (Tang et al., 27 Feb 2026)

DesignSense-10k (Gopal et al., 26 Feb 2026)

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

SpatialReward-Dataset for Spatial-Aware Rewards

1. Dataset Composition and Construction

SpatialReward-260k (Image Editing, (Long et al., 7 Feb 2026))

SpatialReward-Dataset (T2I Preference, (Tang et al., 27 Feb 2026))

DesignSense-10k (Graphic Layouts, (Gopal et al., 26 Feb 2026))

2. Annotation Protocols and Quality Control

Image Editing (Long et al., 7 Feb 2026)

T2I Preference (Tang et al., 27 Feb 2026)

Graphic Layouts (Gopal et al., 26 Feb 2026)

3. Spatially-Aware Representations and File Formats

4. Dataset Statistics and Quantitative Properties

SpatialReward-260k (Long et al., 7 Feb 2026)

SpatialReward-Dataset (Tang et al., 27 Feb 2026)

DesignSense-10k (Gopal et al., 26 Feb 2026)

5. Benchmarks, Evaluation Protocols, and Metrics

Image Editing (Long et al., 7 Feb 2026)

T2I Preference (Tang et al., 27 Feb 2026)

Graphic Layouts (Gopal et al., 26 Feb 2026)

6. Access Methods and File Organization

SpatialReward-260k (Long et al., 7 Feb 2026)

SpatialReward-Dataset (Tang et al., 27 Feb 2026)

DesignSense-10k (Gopal et al., 26 Feb 2026)

7. Impact and Related Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics