Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpatialEdit-Bench: Spatial Editing Evaluation

Updated 13 April 2026
  • SpatialEdit-Bench is a geometry-aware benchmark that evaluates image editing by testing both object-centric and camera-centric spatial transformations.
  • The benchmark employs dual metrics—LPIPS for perceptual plausibility and geometric fidelity measures like Viewpoint and Framing Errors—to assess precise 3D edits.
  • It leverages a large synthetic dataset (SpatialEdit-500k) and detailed evaluation protocols to differentiate true spatial competence from superficial inpainting.

SpatialEdit-Bench is a geometry-aware benchmark explicitly designed for evaluating the fine-grained spatial editing capabilities of image generative models. It assesses not only the perceptual plausibility of editing outputs but, critically, their adherence to prescribed geometric transformations, operationalized via object-centric and camera-centric manipulations. The benchmark is structured to diagnose an editor's grasp of "where" and "how" edits by verifying both photorealism and implementation of intended 3D changes, addressing a major shortcoming of previous benchmarks which lack sensitivity to spatial precision (Xiao et al., 6 Apr 2026).

1. Task Definition and Axis Structure

SpatialEdit-Bench formalizes spatial image editing along two orthogonal axes: object-centric manipulation and camera-centric view control. Each axis encompasses specific, parameterizable tasks:

  • Object-centric manipulation:
    • Translation: Relocating a designated object into a target bounding-box region within the image.
    • Scaling: Resizing the object to fit a given rectangle, effecting changes in both scale and placement.
    • Rotation: Reorienting the object to one of eight canonical views (e.g., front, front-right, right, etc.).
  • Camera-centric view control:
    • Yaw: Applying horizontal rotation (orbiting) of the virtual camera in discretized 45° steps.
    • Pitch: Adjusting vertical tilt in discretized 15° increments.
    • Zoom: Modifying camera distance to perform zoom-in/zoom-out operations toward a focus object, altering both scale and perspective.

This structure enables isolated and compositional evaluation of an editor's capabilities in reproducing precise spatial and viewpoint changes.

2. Evaluation Metrics

SpatialEdit-Bench introduces a dual-metric suite to quantify both semantic plausibility and geometric accuracy:

  • Perceptual Plausibility (LPIPS):
    • Adopted from learned perceptual similarity metrics, the LPIPS distance is computed as

    dLPIPS(x,y)=∥ϕ(x)−ϕ(y)∥2,d_{\mathrm{LPIPS}}(x,y) = \left\lVert \phi(x) - \phi(y) \right\rVert_2,

    where Ï•(â‹…)\phi(\cdot) denotes a pre-trained deep feature extractor, typically VGG. Lower values reflect increased perceptual similarity, serving as a baseline for plausibility.

  • Geometric Fidelity:

    • Viewpoint Error (VE): Integrates translation and rotation errors in predicted camera pose, measuring deviation from ground-truth in SE(3)\mathrm{SE}(3). This is defined as:

    ϵxyz=∥Cpred−Cgt∥2∥Cgt−Csrc∥2+ε,\epsilon_{xyz} = \frac{\left\|\mathbf{C}_{\mathrm{pred}} - \mathbf{C}_{\mathrm{gt}}\right\|_2}{\left\|\mathbf{C}_{\mathrm{gt}} - \mathbf{C}_{\mathrm{src}}\right\|_2 + \varepsilon},

    ϵrot=190dgeo(Rpred,Rgt),\epsilon_{\mathrm{rot}} = \frac{1}{90} d_{\mathrm{geo}}(R_{\mathrm{pred}}, R_{\mathrm{gt}}),

    VE=12(ϵxyz+ϵrot),\mathrm{VE} = \frac{1}{2}(\epsilon_{xyz} + \epsilon_{\mathrm{rot}}),

    with the rotational distance dgeod_{\mathrm{geo}} given by the matrix trace formula. - Framing Error (FE): Uses object detectors to verify that salient elements maintain correct image-plane layout post-edit. This metric incorporates angular misalignment and zoom-direction correctness. FE is computed as:

    FE=12(ϵrag+ϵzde),\mathrm{FE} = \frac{1}{2}(\epsilon_{\mathrm{rag}} + \epsilon_{\mathrm{zde}}),

    where ϵrag\epsilon_{\mathrm{rag}} measures angular deviation between detected boxes and ϵzde\epsilon_{\mathrm{zde}} encodes correct zooming directionality.

  • Object Sub-task Scores:

    • Moving Score (MS) and Rotation Score (RS) leverage both IoU of bounding boxes and VLM-derived object/view consistency. The overall object score takes their mean.

A summary of the principal metrics is given below:

Metric Purpose Mathematical Definition/Notes
LPIPS Perceptual similarity Ï•(â‹…)\phi(\cdot)0
Viewpoint Error (VE) Camera pose accuracy Combines translation/rotation errors
Framing Error (FE) Image-plane layout correctness Considers bounding box alignment, zoom
MS, RS, Object Overall Score Object manipulation accuracy VLM and detector-based, meaned for summary

These metrics are designed to reveal not only visual believability but the edit's fidelity to the instructed geometric intent.

3. Ground-Truth Transformations and Representation

All edits in SpatialEdit-Bench are grounded in explicit 3D transformations:

  • Object-centric edits are formalized as rigid-body transforms in Ï•(â‹…)\phi(\cdot)1:

Ï•(â‹…)\phi(\cdot)2

where Ï•(â‹…)\phi(\cdot)3 (rotation) and Ï•(â‹…)\phi(\cdot)4 (translation). Scaling is ensured via uniform scale in the rendering process.

  • Camera-centric edits are parameterized by camera extrinsics Ï•(â‹…)\phi(\cdot)5. Ground-truth, source, and predicted cameras are encoded, enabling rigorous geometric comparison using both direct pose error:

Ï•(â‹…)\phi(\cdot)6

and the normalized errors described in VE.

  • 2D Image-Plane Projections: For evaluating layout-specific metrics like FE, precise 2D bounding box projections are calculated from known 3D geometry.

This level of control and annotation enables unambiguous evaluation of spatial edits in a physically grounded manner.

4. The SpatialEdit-500k Synthetic Dataset

Benchmarking with precise spatial supervision necessitates large, annotated datasets. SpatialEdit-500k addresses this via Blender-powered synthetic generation:

  • Object-centric pipeline:
    • Ï•(â‹…)\phi(\cdot)710K GLB assets from TexVerse form the object pool.
    • Each asset is rendered in a canonical front view; VLM filtering ensures correct pose.
    • Eight canonical viewpoints (via discrete in-place rotation), random 2D translations and 3D uniform scaling are applied.
    • Semantic segmentation is performed using SAM3.
    • Foregrounds are composited with backgrounds generated by a text-to-image model conditioned on object class.
    • 3D bounding box projections produce exact box labels.
  • Camera-centric pipeline:
    • Ï•(â‹…)\phi(\cdot)8200 curated 3D scenes, with Ï•(â‹…)\phi(\cdot)9 salient focus objects per scene.
    • Cameras are sampled across yaw, pitch, distance grid; object/truncation errors are filtered via YOLO and VLM.
    • Camera parameters SE(3)\mathrm{SE}(3)0 are recorded, and source-target pairs are formed with known transformation deltas.
    • Templated and natural-language instructions are auto-generated for each edit.

The dataset ultimately comprises approximately 500,000 image pairs, systematically balanced across all six benchmark sub-tasks.

5. Benchmark Evaluation Protocol

SpatialEdit-Bench defines a comprehensive evaluation protocol to ensure objectivity and rigor:

  • Hold-out splits: Specific objects and scenes are reserved for test to eliminate memorization or overfitting.
  • Per sub-task evaluation:
    • Editors receive SE(3)\mathrm{SE}(3)1 and must render SE(3)\mathrm{SE}(3)2.
    • Object-centric tasks are scored with MS and RS, averaged for the object overall score (higher is better).
    • Camera-centric tasks are scored with VE and FE, averaged for camera overall error (lower is better).
  • Model ranking: By default, models are ranked by descending object overall and ascending camera error; a composite normalized score is also supported.

This protocol facilitates fair, reproducible comparison of spatial editing across systems.

6. Empirical Benchmarks and Baseline Performance

SpatialEdit-16B serves as a reference model, trained on the SpatialEdit-500k dataset. Comparative performance with existing methods is as follows:

Task SpatialEdit-16B LongCat Difference
Object moving score (MS) 0.673 0.373 +0.300
Object rotation score (RS) 0.632 0.505 +0.127
Viewpoint error (VE) 0.243 0.802 –0.559
Framing error (FE) 0.527 0.684 –0.157
Object overall 0.653 — —
Camera overall error 0.385 — —

SpatialEdit-16B achieves strong semantic plausibility while substantially improving on geometric fidelity, especially on object manipulation and precise viewpoint control (Xiao et al., 6 Apr 2026). This suggests that SpatialEdit-Bench is capable of differentiating between general attention-based inpainting and true spatial understanding.

7. Significance and Diagnostic Scope

SpatialEdit-Bench advances the evaluation of image editing models by enabling joint assessment of semantic believability and spatial manipulation fidelity. Its geometry-aware supervision, task granularity, and rigorous metric design allow researchers to distinguish superficial plausibility from genuine spatial competence. A plausible implication is that, as editors trained and evaluated via SpatialEdit-500k approach lower Viewpoint/Framing Errors and higher object scores, real-world editing pipelines can attain confidence not only in visual plausibility but also in faithful execution of precise geometric instructions (Xiao et al., 6 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpatialEdit-Bench.