SpatialEdit-Bench: Spatial Editing Evaluation
- SpatialEdit-Bench is a geometry-aware benchmark that evaluates image editing by testing both object-centric and camera-centric spatial transformations.
- The benchmark employs dual metrics—LPIPS for perceptual plausibility and geometric fidelity measures like Viewpoint and Framing Errors—to assess precise 3D edits.
- It leverages a large synthetic dataset (SpatialEdit-500k) and detailed evaluation protocols to differentiate true spatial competence from superficial inpainting.
SpatialEdit-Bench is a geometry-aware benchmark explicitly designed for evaluating the fine-grained spatial editing capabilities of image generative models. It assesses not only the perceptual plausibility of editing outputs but, critically, their adherence to prescribed geometric transformations, operationalized via object-centric and camera-centric manipulations. The benchmark is structured to diagnose an editor's grasp of "where" and "how" edits by verifying both photorealism and implementation of intended 3D changes, addressing a major shortcoming of previous benchmarks which lack sensitivity to spatial precision (Xiao et al., 6 Apr 2026).
1. Task Definition and Axis Structure
SpatialEdit-Bench formalizes spatial image editing along two orthogonal axes: object-centric manipulation and camera-centric view control. Each axis encompasses specific, parameterizable tasks:
- Object-centric manipulation:
- Translation: Relocating a designated object into a target bounding-box region within the image.
- Scaling: Resizing the object to fit a given rectangle, effecting changes in both scale and placement.
- Rotation: Reorienting the object to one of eight canonical views (e.g., front, front-right, right, etc.).
- Camera-centric view control:
- Yaw: Applying horizontal rotation (orbiting) of the virtual camera in discretized 45° steps.
- Pitch: Adjusting vertical tilt in discretized 15° increments.
- Zoom: Modifying camera distance to perform zoom-in/zoom-out operations toward a focus object, altering both scale and perspective.
This structure enables isolated and compositional evaluation of an editor's capabilities in reproducing precise spatial and viewpoint changes.
2. Evaluation Metrics
SpatialEdit-Bench introduces a dual-metric suite to quantify both semantic plausibility and geometric accuracy:
- Perceptual Plausibility (LPIPS):
- Adopted from learned perceptual similarity metrics, the LPIPS distance is computed as
where denotes a pre-trained deep feature extractor, typically VGG. Lower values reflect increased perceptual similarity, serving as a baseline for plausibility.
Geometric Fidelity:
- Viewpoint Error (VE): Integrates translation and rotation errors in predicted camera pose, measuring deviation from ground-truth in . This is defined as:
with the rotational distance given by the matrix trace formula. - Framing Error (FE): Uses object detectors to verify that salient elements maintain correct image-plane layout post-edit. This metric incorporates angular misalignment and zoom-direction correctness. FE is computed as:
where measures angular deviation between detected boxes and encodes correct zooming directionality.
Object Sub-task Scores:
- Moving Score (MS) and Rotation Score (RS) leverage both IoU of bounding boxes and VLM-derived object/view consistency. The overall object score takes their mean.
A summary of the principal metrics is given below:
| Metric | Purpose | Mathematical Definition/Notes |
|---|---|---|
| LPIPS | Perceptual similarity | 0 |
| Viewpoint Error (VE) | Camera pose accuracy | Combines translation/rotation errors |
| Framing Error (FE) | Image-plane layout correctness | Considers bounding box alignment, zoom |
| MS, RS, Object Overall Score | Object manipulation accuracy | VLM and detector-based, meaned for summary |
These metrics are designed to reveal not only visual believability but the edit's fidelity to the instructed geometric intent.
3. Ground-Truth Transformations and Representation
All edits in SpatialEdit-Bench are grounded in explicit 3D transformations:
- Object-centric edits are formalized as rigid-body transforms in 1:
2
where 3 (rotation) and 4 (translation). Scaling is ensured via uniform scale in the rendering process.
- Camera-centric edits are parameterized by camera extrinsics 5. Ground-truth, source, and predicted cameras are encoded, enabling rigorous geometric comparison using both direct pose error:
6
and the normalized errors described in VE.
- 2D Image-Plane Projections: For evaluating layout-specific metrics like FE, precise 2D bounding box projections are calculated from known 3D geometry.
This level of control and annotation enables unambiguous evaluation of spatial edits in a physically grounded manner.
4. The SpatialEdit-500k Synthetic Dataset
Benchmarking with precise spatial supervision necessitates large, annotated datasets. SpatialEdit-500k addresses this via Blender-powered synthetic generation:
- Object-centric pipeline:
- 710K GLB assets from TexVerse form the object pool.
- Each asset is rendered in a canonical front view; VLM filtering ensures correct pose.
- Eight canonical viewpoints (via discrete in-place rotation), random 2D translations and 3D uniform scaling are applied.
- Semantic segmentation is performed using SAM3.
- Foregrounds are composited with backgrounds generated by a text-to-image model conditioned on object class.
- 3D bounding box projections produce exact box labels.
- Camera-centric pipeline:
- 8200 curated 3D scenes, with 9 salient focus objects per scene.
- Cameras are sampled across yaw, pitch, distance grid; object/truncation errors are filtered via YOLO and VLM.
- Camera parameters 0 are recorded, and source-target pairs are formed with known transformation deltas.
- Templated and natural-language instructions are auto-generated for each edit.
The dataset ultimately comprises approximately 500,000 image pairs, systematically balanced across all six benchmark sub-tasks.
5. Benchmark Evaluation Protocol
SpatialEdit-Bench defines a comprehensive evaluation protocol to ensure objectivity and rigor:
- Hold-out splits: Specific objects and scenes are reserved for test to eliminate memorization or overfitting.
- Per sub-task evaluation:
- Editors receive 1 and must render 2.
- Object-centric tasks are scored with MS and RS, averaged for the object overall score (higher is better).
- Camera-centric tasks are scored with VE and FE, averaged for camera overall error (lower is better).
- Model ranking: By default, models are ranked by descending object overall and ascending camera error; a composite normalized score is also supported.
This protocol facilitates fair, reproducible comparison of spatial editing across systems.
6. Empirical Benchmarks and Baseline Performance
SpatialEdit-16B serves as a reference model, trained on the SpatialEdit-500k dataset. Comparative performance with existing methods is as follows:
| Task | SpatialEdit-16B | LongCat | Difference |
|---|---|---|---|
| Object moving score (MS) | 0.673 | 0.373 | +0.300 |
| Object rotation score (RS) | 0.632 | 0.505 | +0.127 |
| Viewpoint error (VE) | 0.243 | 0.802 | –0.559 |
| Framing error (FE) | 0.527 | 0.684 | –0.157 |
| Object overall | 0.653 | — | — |
| Camera overall error | 0.385 | — | — |
SpatialEdit-16B achieves strong semantic plausibility while substantially improving on geometric fidelity, especially on object manipulation and precise viewpoint control (Xiao et al., 6 Apr 2026). This suggests that SpatialEdit-Bench is capable of differentiating between general attention-based inpainting and true spatial understanding.
7. Significance and Diagnostic Scope
SpatialEdit-Bench advances the evaluation of image editing models by enabling joint assessment of semantic believability and spatial manipulation fidelity. Its geometry-aware supervision, task granularity, and rigorous metric design allow researchers to distinguish superficial plausibility from genuine spatial competence. A plausible implication is that, as editors trained and evaluated via SpatialEdit-500k approach lower Viewpoint/Framing Errors and higher object scores, real-world editing pipelines can attain confidence not only in visual plausibility but also in faithful execution of precise geometric instructions (Xiao et al., 6 Apr 2026).