Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpatialReward-Dataset for Spatial-Aware Rewards

Updated 3 March 2026
  • SpatialReward-Dataset is a series of spatially grounded datasets used for reward modeling in image editing, text-to-image generation, and layout synthesis.
  • It encompasses distinct collections like SpatialReward-260k, an 80k T2I-focused set, and DesignSense-10k, each featuring rigorous annotations and diverse spatial schemas.
  • The datasets employ multi-stage annotation protocols and precise evaluation metrics to enhance fine-grained spatial reasoning in generative models.

SpatialReward-Dataset refers to several spatially grounded datasets used for reward modeling in image editing, text-to-image (T2I) generation, and graphic layout synthesis. These datasets underpin new reward models based on explicit or preference-driven spatial reasoning, addressing challenges in fine-grained alignment and perceptual evaluation. Notably, the term encompasses: (1) the SpatialReward-260k dataset for online RL in image editing (Long et al., 7 Feb 2026); (2) the T2I-oriented SpatialReward-Dataset (80k pairs) for SpatialScore (Tang et al., 27 Feb 2026); (3) the DesignSense-10k dataset for layout preference modeling (Gopal et al., 26 Feb 2026). Each dataset is characterized by rigorous annotation protocols, spatially detailed schemas, and direct usage for training/evaluating spatial-aware reward models.

1. Dataset Composition and Construction

  • Scale and Structure: 260,000 image–edit pairs in the supervised fine-tuning (SFT) pool; partitioned as 100,000 Refined EditScore, 100,000 Re-purposed EditReward (with regenerated fine-grained reasoning), and 60,000 custom Multi-Edit samples, the latter organized according to a 15-subtask taxonomy.
  • Task Distribution: In the custom Multi-Edit subset, the distribution is 2:1:1 (General Editing : Human-Centric Basics : Human-Centric Fine Details). Legacy sets inherit a wide task mix (style transfer, additions, removals, and compositional edits).
  • Source Corpus: Legacy subsets are built from public instruction–edit corpora (InstructPix2Pix, EditReward). Multi-Edit samples draw from laion2B-en-aesthetic, CC12M, and a curated portrait set, with a focus on high-resolution and high-clarity editable regions.
  • Selection Criteria: High-res, clearly editable, no abstraction/text-only; human-centric subtasks require visible, front-facing faces.
  • Scale and Structure: 80,000 adversarial preference pairs plus a 365-pair held-out evaluation set.
  • Sample Format: Each record includes a prompt cc, two generated images (y1y_1 “perfect”, y2y_2 “perturbed”), a binary preference (winner), list of instantiated spatial relations, and model provenance.
  • Spatial Relation Taxonomy: 12–15 types (e.g., LeftOf, RightOf, Above, Below, InFrontOf, Behind, Between, Inside, OnTopOf, SurroundedBy), with each prompt averaging 2–4 predicates.
  • Scale and Splits: 10,235 human-annotated preference pairs; split as 8,735 train, 500 val, 1,000 test.
  • Annotation Diversity: Four-class scheme (left better, right better, both good, both bad). Aspect ratios varied systematically; layouts span 4–20 elements, with post-group grouping yielding 2–8 semantic clusters.

2. Annotation Protocols and Quality Control

  • Three-Stage Pipeline:
  1. Spatial Grounding: Qwen-3-VL-235B predicts edit-object bounding boxes, normalized to [0,1000] grid.
  2. Expert Routing: Human-centric edits routed to Gemini-2.5-Pro; general edits to GPT-5, both using visual overlays, generating chain-of-thought (T_raw) and semantic/perceptual quality scores (sif,scon,snat,sarts_{if}, s_{con}, s_{nat}, s_{art}).
  3. Alignment and Verification: Qwen-3-VL refines reasoning, interleaves <|bbox_id|>, <|global|> tokens; samples with rationale/box misalignment flagged for removal.
  • QC Measures: Correct box count by edit number; stringent hallucination filter; for MER-Bench, 5-expert consensus (Cohen’s κ>0.8\kappa>0.8).
  • Prompt Engineering: GPT-5 generates a “clean” prompt, then perturbed variants via relation flipping/swapping.
  • Image Generation: SOTA T2I backends (Qwen-Image, HunyuanImage-2.1, Seedream 4.0).
  • Human Verification: Dual annotators per pair; discordant cases resolved by a third; ambiguous/misaligned samples discarded.
  • Agreement: Cohen’s κ=0.82\kappa=0.82 over 2,000 pairs.
  • Pipeline Stages: Semantic grouping (GPT-4o); layout prediction (AesthetiQ with sampling); filtering (overlap/outlier removal); IoU-based clustering; VLM refinement (“minimize overlaps”, “align edges”).
  • Annotation: MTurk plus author verification; four-way judgment; gold-standard pairs for QC; class distribution controlled for imbalance.

3. Spatially-Aware Representations and File Formats

Dataset Representation Main Format/Fields
SpatialReward-260k Bounding-boxes JSONL: image_id, instruction, orig_path, edit_path, edit_region, reasoning, score_sc, score_pq
SpatialReward-Dataset Paired preference JSONL: pair_id, prompt, perfect_image, perturbed_image, winner, relations, model
DesignSense-10k Layout preference Structure: paired layout images, 4-class annotation, semantic grouping metadata, bounding boxes

In all cases, spatial relations are explicit: via bounding boxes (editing/layouts) or relation predicate lists (T2I). File layouts support efficient batch parsing. Inspired by COCO for editing (no polygon masks), and structured for pairwise ranking (T2I/layout).

4. Dataset Statistics and Quantitative Properties

  • Mean Boxes/Image: \approx 1.15 (σ0.42\sigma\approx0.42)
  • Mean Region Area: \approx 8% (σ5%\sigma\approx 5\%)
  • Multi-Edit Task Distribution: 50% General, 25% Human-Basic, 25% Human-Fine
  • Relation Distribution: LeftOf/RightOf (30%), Above/Below (25%), InFrontOf/Behind (15%), Between (10%), Inside/OnTopOf/Under (10%), Near/Far/SurroundedBy (10%)
  • Prompt Complexity: μobj=4.8\mu_{\text{obj}}=4.8, σ=1.2\sigma=1.2 (objects); μrel=2.4\mu_{\text{rel}}=2.4, σ=1.1\sigma=1.1 (relations)
  • Prompt Length: Mean 27 tokens (range 10–60)
  • Shannon Entropy (Relation Type): H2.20H\approx2.20 nats
  • Annotation Distribution: Left (28%), Right (30%), Both Good (25%), Both Bad (17%)
  • Aspect Ratios: Cover log2_2(width/height) in [2,2][-2,2]
  • Element Count: 4–20 (peak 8–12), post-group 2–8
  • ARI vs Human Grouping (one-shot GPT-4o): 0.69

5. Benchmarks, Evaluation Protocols, and Metrics

  • Primary Benchmarks: EditReward-Bench (pointwise), MMRB2 (ImgEdit), MER-Bench (MultiEditReward-Bench, 600 eval groups/1,800 edits; accuracy and Kendall’s τ\tau)
  • Statistical Metrics: Intersection-over-Union (IoU) for box alignment:

IoU(Bpred,Bgt)=BpredBgtBpredBgt\operatorname{IoU}(B_{pred},B_{gt}) = \frac{|B_{pred}\cap B_{gt}|}{|B_{pred}\cup B_{gt}|}

Shannon entropy of attention maps H(A)=i,jAijlogAijH(A) = -\sum_{i,j}A_{ij}\log A_{ij}; entropy gap ΔH=H(Asrc)H(Aedit)|\Delta H| = |H(A_{src})-H(A_{edit})|; Pearson stability.

  • Training Objective: Bradley–Terry pairwise ranking loss:

L=E(c,yw,yl)[logσ(R(H(c,yw))R(H(c,yl)))]\mathcal{L} = -\mathbb{E}_{(c,y_w,y_l)} [ \log\sigma(R(H(c,y_w)) - R(H(c,y_l))) ]

  • Reward Model Evaluation: 95.8% pairwise accuracy on the 365-pair benchmark; superiority over GPT-5/Gemini-2.5 Pro. RL-trained models show clear gains in spatial accuracy across DPG-Bench, TIIF-Bench, and UniGenBench++, e.g., relation-spatial subtask jump: 0.871 → 0.932.
  • Evaluation Metrics: Macro F1 (mean per-class F1), Weighted F1, Binary Accuracy (left/right, 81% pilot), and Cohen’s κ\kappa.
  • Results: DesignSense increases Macro F1 by 54.6% over best fewshot baseline; test Macro F1 0.456; weighted F1 0.520. Direct RL utilization improves layout generator win rates by 3–4 pp; inference-time scaling yields +3.6 pp against AesthetiQ baseline.

6. Access Methods and File Organization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SpatialReward/
    images/
        train/orig/
        train/edit/
        rl/
        val/
    annotations/
        train.jsonl
        rl.jsonl
        val.jsonl
    MER-Bench/
        groups_2.jsonl
        groups_3.jsonl
        groups_4.jsonl
    README.md

  • Annotation Example:

1
2
3
4
5
6
7
8
9
10
11
12
{
  "image_id":"SR_000123",
  "orig_path":"images/train/orig/SR_000123.png",
  "edit_path":"images/train/edit/SR_000123.png",
  "instruction":"Change the fabric to silk.",
  "edit_region":[
    {"id":0,"label":"fabric","bbox_2d":[120,300,480,720]}
  ],
  "reasoning":"<|bbox_0|> The fabric was changed to a smooth silk texture. <|global|> No other regions were modified.",
  "score_sc":[23,21],
  "score_pq":[24,25]
}

  • Data format: Paired layouts, bounding boxes, association with semantic groups, CSV/JSONL as appropriate.

SpatialReward-Dataset variants establish spatial-aware reward modeling as a central paradigm for pixel-level and layout-level alignment in generative workflows. SpatialReward-260k’s explicit grounding supports state-of-the-art RL editing agents, resolving “Attention Collapse” by tethering semantic judgments to edit-localized evidence (Long et al., 7 Feb 2026). The T2I preference dataset enables precise evaluation of spatial-relation satisfaction in text-to-image systems; reward-driven RL leveraging this data yields measurable gains in multi-relation spatial fidelity (Tang et al., 27 Feb 2026). DesignSense-10k fills the gap in layout preference modeling, enabling robust, user-aligned evaluation and improvement in high-variance visual domains (Gopal et al., 26 Feb 2026).

A plausible implication is that such datasets, when tightly aligned with pixel/object-level evidence and subject to rigorous multi-stage validation, are indispensable for advancing the spatial fidelity of generative models and downstream alignment in complex image-based RL settings. Interoperability with existing format conventions (e.g., COCO-like layouts) facilitates extension to new domains, including UI design and robotics where explicit spatial reasoning is required.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpatialReward-Dataset.