SpatialEdit-Bench: Spatial Editing Evaluation

Updated 13 April 2026

SpatialEdit-Bench is a geometry-aware benchmark that evaluates image editing by testing both object-centric and camera-centric spatial transformations.
The benchmark employs dual metrics—LPIPS for perceptual plausibility and geometric fidelity measures like Viewpoint and Framing Errors—to assess precise 3D edits.
It leverages a large synthetic dataset (SpatialEdit-500k) and detailed evaluation protocols to differentiate true spatial competence from superficial inpainting.

SpatialEdit-Bench is a geometry-aware benchmark explicitly designed for evaluating the fine-grained spatial editing capabilities of image generative models. It assesses not only the perceptual plausibility of editing outputs but, critically, their adherence to prescribed geometric transformations, operationalized via object-centric and camera-centric manipulations. The benchmark is structured to diagnose an editor's grasp of "where" and "how" edits by verifying both photorealism and implementation of intended 3D changes, addressing a major shortcoming of previous benchmarks which lack sensitivity to spatial precision (Xiao et al., 6 Apr 2026).

1. Task Definition and Axis Structure

SpatialEdit-Bench formalizes spatial image editing along two orthogonal axes: object-centric manipulation and camera-centric view control. Each axis encompasses specific, parameterizable tasks:

Object-centric manipulation:
- Translation: Relocating a designated object into a target bounding-box region within the image.
- Scaling: Resizing the object to fit a given rectangle, effecting changes in both scale and placement.
- Rotation: Reorienting the object to one of eight canonical views (e.g., front, front-right, right, etc.).
Camera-centric view control:
- Yaw: Applying horizontal rotation (orbiting) of the virtual camera in discretized 45° steps.
- Pitch: Adjusting vertical tilt in discretized 15° increments.
- Zoom: Modifying camera distance to perform zoom-in/zoom-out operations toward a focus object, altering both scale and perspective.

This structure enables isolated and compositional evaluation of an editor's capabilities in reproducing precise spatial and viewpoint changes.

2. Evaluation Metrics

SpatialEdit-Bench introduces a dual-metric suite to quantify both semantic plausibility and geometric accuracy:

Perceptual Plausibility (LPIPS):
- Adopted from learned perceptual similarity metrics, the LPIPS distance is computed as
$d_{\mathrm{LPIPS}}(x,y) = \left\lVert \phi(x) - \phi(y) \right\rVert_2,$

where $\phi(\cdot)$ denotes a pre-trained deep feature extractor, typically VGG. Lower values reflect increased perceptual similarity, serving as a baseline for plausibility.
Geometric Fidelity:
- Viewpoint Error (VE): Integrates translation and rotation errors in predicted camera pose, measuring deviation from ground-truth in $\mathrm{SE}(3)$ . This is defined as:
$\epsilon_{xyz} = \frac{\left\|\mathbf{C}_{\mathrm{pred}} - \mathbf{C}_{\mathrm{gt}}\right\|_2}{\left\|\mathbf{C}_{\mathrm{gt}} - \mathbf{C}_{\mathrm{src}}\right\|_2 + \varepsilon},$

$\epsilon_{\mathrm{rot}} = \frac{1}{90} d_{\mathrm{geo}}(R_{\mathrm{pred}}, R_{\mathrm{gt}}),$

$\mathrm{VE} = \frac{1}{2}(\epsilon_{xyz} + \epsilon_{\mathrm{rot}}),$

with the rotational distance $d_{\mathrm{geo}}$ given by the matrix trace formula. - Framing Error (FE): Uses object detectors to verify that salient elements maintain correct image-plane layout post-edit. This metric incorporates angular misalignment and zoom-direction correctness. FE is computed as:

$\mathrm{FE} = \frac{1}{2}(\epsilon_{\mathrm{rag}} + \epsilon_{\mathrm{zde}}),$

where $\epsilon_{\mathrm{rag}}$ measures angular deviation between detected boxes and $\epsilon_{\mathrm{zde}}$ encodes correct zooming directionality.
Object Sub-task Scores:
- Moving Score (MS) and Rotation Score (RS) leverage both IoU of bounding boxes and VLM-derived object/view consistency. The overall object score takes their mean.

A summary of the principal metrics is given below:

Metric	Purpose	Mathematical Definition/Notes
LPIPS	Perceptual similarity	$\phi(\cdot)$ 0
Viewpoint Error (VE)	Camera pose accuracy	Combines translation/rotation errors
Framing Error (FE)	Image-plane layout correctness	Considers bounding box alignment, zoom
MS, RS, Object Overall Score	Object manipulation accuracy	VLM and detector-based, meaned for summary

These metrics are designed to reveal not only visual believability but the edit's fidelity to the instructed geometric intent.

3. Ground-Truth Transformations and Representation

All edits in SpatialEdit-Bench are grounded in explicit 3D transformations:

Object-centric edits are formalized as rigid-body transforms in $\phi(\cdot)$ 1:

$\phi(\cdot)$ 2

where $\phi(\cdot)$ 3 (rotation) and $\phi(\cdot)$ 4 (translation). Scaling is ensured via uniform scale in the rendering process.

Camera-centric edits are parameterized by camera extrinsics $\phi(\cdot)$ 5. Ground-truth, source, and predicted cameras are encoded, enabling rigorous geometric comparison using both direct pose error:

$\phi(\cdot)$ 6

and the normalized errors described in VE.

2D Image-Plane Projections: For evaluating layout-specific metrics like FE, precise 2D bounding box projections are calculated from known 3D geometry.

This level of control and annotation enables unambiguous evaluation of spatial edits in a physically grounded manner.

4. The SpatialEdit-500k Synthetic Dataset

Benchmarking with precise spatial supervision necessitates large, annotated datasets. SpatialEdit-500k addresses this via Blender-powered synthetic generation:

Object-centric pipeline:
- $\phi(\cdot)$ 710K GLB assets from TexVerse form the object pool.
- Each asset is rendered in a canonical front view; VLM filtering ensures correct pose.
- Eight canonical viewpoints (via discrete in-place rotation), random 2D translations and 3D uniform scaling are applied.
- Semantic segmentation is performed using SAM3.
- Foregrounds are composited with backgrounds generated by a text-to-image model conditioned on object class.
- 3D bounding box projections produce exact box labels.
Camera-centric pipeline:
- $\phi(\cdot)$ 8200 curated 3D scenes, with $\phi(\cdot)$ 9 salient focus objects per scene.
- Cameras are sampled across yaw, pitch, distance grid; object/truncation errors are filtered via YOLO and VLM.
- Camera parameters $\mathrm{SE}(3)$ 0 are recorded, and source-target pairs are formed with known transformation deltas.
- Templated and natural-language instructions are auto-generated for each edit.

The dataset ultimately comprises approximately 500,000 image pairs, systematically balanced across all six benchmark sub-tasks.

5. Benchmark Evaluation Protocol

SpatialEdit-Bench defines a comprehensive evaluation protocol to ensure objectivity and rigor:

Hold-out splits: Specific objects and scenes are reserved for test to eliminate memorization or overfitting.
Per sub-task evaluation:
- Editors receive $\mathrm{SE}(3)$ 1 and must render $\mathrm{SE}(3)$ 2.
- Object-centric tasks are scored with MS and RS, averaged for the object overall score (higher is better).
- Camera-centric tasks are scored with VE and FE, averaged for camera overall error (lower is better).
Model ranking: By default, models are ranked by descending object overall and ascending camera error; a composite normalized score is also supported.

This protocol facilitates fair, reproducible comparison of spatial editing across systems.

6. Empirical Benchmarks and Baseline Performance

SpatialEdit-16B serves as a reference model, trained on the SpatialEdit-500k dataset. Comparative performance with existing methods is as follows:

Task	SpatialEdit-16B	LongCat	Difference
Object moving score (MS)	0.673	0.373	+0.300
Object rotation score (RS)	0.632	0.505	+0.127
Viewpoint error (VE)	0.243	0.802	–0.559
Framing error (FE)	0.527	0.684	–0.157
Object overall	0.653	—	—
Camera overall error	0.385	—	—

SpatialEdit-16B achieves strong semantic plausibility while substantially improving on geometric fidelity, especially on object manipulation and precise viewpoint control (Xiao et al., 6 Apr 2026). This suggests that SpatialEdit-Bench is capable of differentiating between general attention-based inpainting and true spatial understanding.

7. Significance and Diagnostic Scope

SpatialEdit-Bench advances the evaluation of image editing models by enabling joint assessment of semantic believability and spatial manipulation fidelity. Its geometry-aware supervision, task granularity, and rigorous metric design allow researchers to distinguish superficial plausibility from genuine spatial competence. A plausible implication is that, as editors trained and evaluated via SpatialEdit-500k approach lower Viewpoint/Framing Errors and higher object scores, real-world editing pipelines can attain confidence not only in visual plausibility but also in faithful execution of precise geometric instructions (Xiao et al., 6 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpatialEdit-Bench.

SpatialEdit-Bench: Spatial Editing Evaluation

1. Task Definition and Axis Structure

2. Evaluation Metrics

3. Ground-Truth Transformations and Representation

4. The SpatialEdit-500k Synthetic Dataset

5. Benchmark Evaluation Protocol

6. Empirical Benchmarks and Baseline Performance

7. Significance and Diagnostic Scope

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SpatialEdit-Bench: Spatial Editing Evaluation

1. Task Definition and Axis Structure

2. Evaluation Metrics

3. Ground-Truth Transformations and Representation

4. The SpatialEdit-500k Synthetic Dataset

5. Benchmark Evaluation Protocol

6. Empirical Benchmarks and Baseline Performance

7. Significance and Diagnostic Scope

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research