SpaRRTa: Benchmark for Spatial Relation Recognition

Updated 4 July 2026

SpaRRTa is a synthetic benchmark that evaluates visual models on 4-way directional classification to determine spatial relations between objects.
It uses a procedural scene generation pipeline with Unreal Engine 5.5 to create photorealistic scenarios with precise ground-truth 3D geometric annotations.
The benchmark distinguishes egocentric and allocentric tasks, highlighting how probe design affects the extraction of spatial intelligence from frozen visual features.

Searching arXiv for the specified SpaRRTa paper and closely related spatial-relation benchmarks to ground the article. Spatial Relation Recognition Task (SpaRRTa) is a photorealistic synthetic benchmark for probing whether visual foundation models encode object-to-object spatial relations rather than only semantic categories or narrowly targeted geometric signals such as monocular depth or surface normals (Kargin et al., 16 Jan 2026). It formulates single-image spatial understanding as a 4-way directional classification problem over a source object, a target object, and a viewpoint reference, and evaluates frozen visual representations through lightweight probes rather than end-to-end multimodal prompting. In the paper’s framing, SpaRRTa is intended as a diagnostic benchmark for “spatial intelligence” in visual foundation models, with particular emphasis on whether spatial structure is transferable across environments, reference frames, and model families (Kargin et al., 16 Jan 2026).

1. Conceptual scope and motivation

SpaRRTa is motivated by the claim that modern visual foundation models are strong at semantic visual understanding but weaker and more inconsistent at spatial reasoning, especially when the target is an abstract relational judgment rather than a metric 3D prediction (Kargin et al., 16 Jan 2026). The benchmark is therefore designed to separate two questions that are often conflated in prior evaluation: whether a model can recover a quantitative geometric signal such as depth, and whether it can determine where one visible object is relative to another from a specified perspective. The paper repeatedly presents the latter as a more direct probe of human-like spatial understanding.

This distinction matters because prior evaluation protocols often concentrate on monocular depth estimation, surface normal estimation, camera pose estimation, or other regression-style 3D tasks. SpaRRTa instead asks whether a frozen representation supports directional relation recognition. The paper’s stated target is not generic scene description, image-text retrieval, or instruction following, but the recoverability of relative position between visible objects.

A second conceptual contribution is the explicit use of viewpoint-conditioned relation recognition. SpaRRTa does not assume that directional relations are frame-free. It introduces an easier egocentric setting and a harder allocentric setting, thereby making perspective-taking itself part of the benchmark. This suggests that the benchmark is aimed not only at local geometric readout, but also at the ability to transform between observed and required frames of reference.

2. Synthetic scene generation and annotation pipeline

SpaRRTa is fully synthetic, but the paper emphasizes that it is designed to remain photorealistic and in-distribution for common visual foundation models trained on natural images (Kargin et al., 16 Jan 2026). The rendering stack uses Unreal Engine 5.5, the Unreal Engine Python API for automation, and UnrealCV for image and mask capture. The paper highlights UE5’s real-time ray tracing, dynamic global illumination, reflections, level-of-detail handling, access to high-fidelity marketplace assets, and ability to run on consumer GPUs.

The generation process is automatic and consists of four stages: Set stage, Set camera, Render view, and Get ground truth. In the first stage, the system selects an environment and assets, then places source, target, and viewpoint objects. In the second, it samples a camera pose and verifies geometric validity. In the third, it renders the RGB image. In the fourth, it assigns the ground-truth label from exact 3D object positions.

Object placement and camera placement are procedural. The generator selects task objects from a map-specific asset list, places them on the ground, samples their positions from a Gaussian distribution, samples the camera from a uniform distribution over an area around the map center, and orients the camera toward the placed objects. Appendix details state that object coordinates are sampled around a randomly chosen center point, with distance limits used to keep objects distinguishable. To ensure physical plausibility, the system uses raycasting / line traces so that objects lie flush with terrain in environments such as forest or desert scenes.

Camera geometry is standardized. The camera uses a simulated 50 mm lens and 50 mm sensor width, yielding a fixed horizontal field of view. Camera height is randomized relative to object elevation, and the camera is rotated toward the geometric centroid of the task-critical objects. Visibility constraints are enforced before rendering: candidate placements are checked against the camera frustum, objects must fall within the camera field of view, configurations with excessive clustering or excessive distance are rejected, and AABB overlap checks are used to avoid inter-object collisions.

Ground truth is generated automatically from exact 3D coordinates, segmentation masks, and known scene state. The released metadata include serialized 6-DoF state values of the form

$(x,y,z,\phi,\theta,\psi).$

The benchmark therefore couples RGB images to exact geometric scene state, while the evaluation itself remains a 2D image-based probing task.

3. Task formulation, benchmark variants, and label semantics

SpaRRTa is a single-image spatial relation classification task in which the system receives a source object, a target object, and a viewpoint reference, and must predict the direction from the source to the target relative to that viewpoint (Kargin et al., 16 Jan 2026). The label space is exactly

$\{\text{left}, \text{right}, \text{front}, \text{back}\}.$

This is a discrete 4-way classification problem, not a metric regression task.

The benchmark defines two standardized variants.

Variant	Viewpoint reference	Images per environment
SpaRRTa-ego	Camera	5,000
SpaRRTa-allo	Human in scene	10,000

In SpaRRTa-ego, the camera is the viewpoint reference, so the question is where the target lies relative to the source from the camera’s perspective. In SpaRRTa-allo, a human placed in the scene is the viewpoint reference, so the model must infer where the target lies relative to the source from the human’s perspective. The paper identifies the allocentric variant as harder because the observed camera view and the required reference frame differ.

The benchmark uses five environments: Forest, Desert, Winter Town, Bridge, and City. For each environment, the authors use 3 distinct object triples and 2 random seeds. The generated data for each object triple are split 80% / 10% / 10% into train, validation, and test. The appendix states that 5,000 images per environment were sufficient for the egocentric task, whereas allocentric probing showed overfitting signs at that scale, leading to 10,000 images per environment for the allocentric setting.

Ground-truth labeling is viewpoint-defined and angular. For egocentric evaluation, the “front” direction is defined by the vector from the camera to the source object. For allocentric evaluation, the “front” direction is defined by the vector from the human viewpoint object to the source object. The target’s angular position around the source is then discretized into one of four quadrants. A central design choice is the explicit removal of ambiguous cases near decision boundaries. The benchmark rejects samples whose angle falls within conical exclusion zones centered on the diagonal directions

$45^\circ,\ 135^\circ,\ 225^\circ,\ 315^\circ$

with margin

$\pm 15^\circ.$

Equivalently, if $\theta$ denotes the target angle in the viewpoint-defined frame, the sample is rejected when

$\theta \in \bigcup_{k \in \{45^\circ,135^\circ,225^\circ,315^\circ\}} [k-15^\circ,\ k+15^\circ].$

The paper presents this ambiguity filtering as one of the benchmark’s most important formal decisions, because it reduces label noise around quadrant boundaries.

The paper also notes that many source objects are chosen to be isotropic, such as rock, tree, and traffic cone, to reduce ambiguity from source-object orientation. This suggests that SpaRRTa is deliberately structured to isolate directional reasoning from nuisance variability wherever possible.

4. Probing methodology and evaluated visual foundation models

SpaRRTa is evaluated as a frozen-feature probing benchmark, not as a prompting benchmark and not as full model fine-tuning (Kargin et al., 16 Jan 2026). At inference time, the frozen visual foundation model receives only the RGB image. The protocol is: render image, extract frozen features, train a lightweight probe, and predict the relation class. The paper also states that probes are trained separately for particular source/target/viewpoint object triples.

The evaluated backbone set spans several training paradigms. The paper benchmarks joint-embedding self-supervised models (DINO, DINOv2 (B/14), DINOv2 + registers (B/14), DINOv2 + registers (L/14), DINOv3), masked image modeling and geometry-aware self-supervision (MAE, MaskFeat, SPA, CroCo, CroCo v2), and supervised or vision-LLMs (VGGT, DeiT, CLIP). Input resolution is standardized to

$224 \times 224.$

Three probe types are compared.

Probe	Mechanism	Key detail
Linear probing with GAP	Average patch tokens, linear classifier	1 linear layer
AbMILP	Attention-based multiple instance learning pooling	2-layer MLP
Efficient probing	Multi-query cross-attention aggregation	$N_q = 4$

The probe comparison is scientifically central rather than incidental. The paper argues that spatial information is often stored in local patch tokens and can be washed out by global average pooling. Efficient probing is described as the strongest probe overall. In the reported setup, its output dimension is

$D_o = D_i / 8.$

Training uses frozen backbones and typically extracts features from the final transformer block, although later analyses show that late-intermediate layers are often better. Across probes, the common optimizer is AdamW with weight decay $= 0.001$ , cosine decay scheduling, learning-rate search over

$\{\text{left}, \text{right}, \text{front}, \text{back}\}.$ 0

dropout search over

$\{\text{left}, \text{right}, \text{front}, \text{back}\}.$ 1

and batch size

$\{\text{left}, \text{right}, \text{front}, \text{back}\}.$ 2

Warmup is 200 steps for linear probing and 100 for both AbMILP and efficient probing. Epoch counts are 1000 for linear probing and 500 for both AbMILP and efficient probing. Validation performance selects the best hyperparameters.

The paper’s evaluation metric is classification accuracy on the test split, with per-environment accuracy tables and mean rank across models. It also compares SpaRRTa results to camera pose estimation, monocular depth estimation, FGVC Aircraft, and Flowers102 in separate correlation studies. This suggests that the benchmark is positioned not merely as a leaderboard task, but as a representational diagnostic.

5. Empirical findings and mechanistic interpretation

The benchmark reveals substantial differences between model families and probe types (Kargin et al., 16 Jan 2026). Under strong probing, VGGT is the best overall model, the DINO family is the strongest among broadly self-supervised models, MAE remains competitive, and CLIP and DeiT consistently underperform. A major result is that allocentric evaluation is much harder than egocentric evaluation for every model and every probe. The paper states that egocentric performance can approach saturation for top models, whereas allocentric accuracy drops substantially, often toward near-chance territory under simple linear probing. Since the task is 4-way classification, random chance is

$\{\text{left}, \text{right}, \text{front}, \text{back}\}.$ 3

A second dominant result is the consistent probe ordering

$\{\text{left}, \text{right}, \text{front}, \text{back}\}.$ 4

This is interpreted as evidence that spatial information is present in the representation but is not always accessible from naively pooled global features. The paper’s strongest mechanistic claim is therefore not simply that one backbone is best, but that how one reads out spatial information matters materially.

Environment complexity also matters. Desert and Forest are reported as easier, while cluttered environments such as City and Winter Town are harder. The paper attributes this to background complexity and the greater difficulty of isolating task-relevant patches among clutter. This suggests that SpaRRTa is not merely a test of abstract geometry in isolation; it also measures whether relational evidence survives in realistic visual contexts.

The paper’s mechanistic analysis emphasizes patch-local structure. Comparing DINOv2 and VGGT, it argues that 3D fine-tuning produces a reorganization of attention flow, with less self-focused attention inside an object and more cross-object attention between entities. It also reports that the global $\{\text{left}, \text{right}, \text{front}, \text{back}\}.$ 5-like token is often not the best place to read out spatial reasoning, particularly in VGGT, where relational geometry appears to remain distributed across local patches. Late-intermediate layers are frequently best: for ViT-L backbones, the best layers are often around

$\{\text{left}, \text{right}, \text{front}, \text{back}\}.$ 6

This supports the paper’s broader claim that final layers may be more specialized for semantic abstraction and invariance than for fine-grained spatial readout.

The 3D-supervision story is deliberately nuanced. The paper states that explicit 3D supervision in VGGT improves spatial structure and yields the best results under patch-aware probes, but under simple global linear probing VGGT does not consistently beat DINOv2 and can be slightly worse. This suggests that 3D-aware pretraining helps mainly when the probing method can access distributed local geometry. A plausible implication is that high SpaRRTa performance should not be interpreted as a property of a single global embedding alone.

6. Relation to adjacent benchmarks and stated limitations

SpaRRTa occupies a distinct position among spatial-reasoning benchmarks because it is a synthetic, photorealistic, single-image, frozen-feature probing benchmark focused specifically on viewpoint-conditioned object-to-object directional relations (Kargin et al., 16 Jan 2026). This differs from Visual Spatial Reasoning (VSR), which uses more than 10k natural text-image pairs with 66 types of spatial relations and binary caption verification (Liu et al., 2022); from SpatialSense, which uses adversarial crowdsourcing, 11,569 images, 17,498 relations, and 9 predicates in a binary per-predicate formulation (Yang et al., 2019); and from RoboSpatial, which is explicitly robotics-oriented and organizes spatial understanding into spatial configuration, spatial context, and spatial compatibility over 1M images, 5K 3D scans, and 3M annotated spatial relationships (Song et al., 2024). This comparison suggests that SpaRRTa is narrower in relation vocabulary but cleaner as a representational probe.

The benchmark is also narrower than multimodal VQA-oriented relation suites such as MIRAGE, whose Relation task uses the six labels ["up", "under", "back", "front", "left", "right"] in real images and combines them with counting-oriented tasks (Liu et al., 15 May 2025). SpaRRTa’s difference is that it does not formulate evaluation as JSON-constrained question answering and does not use language prompting at inference; instead, it isolates what can be extracted from frozen vision features alone.

The paper is explicit about limitations. SpaRRTa is synthetic, even if photorealistic. It evaluates only the narrow label set

$\{\text{left}, \text{right}, \text{front}, \text{back}\}.$ 7

Its results are probe-dependent, because the benchmark measures both representation quality and probe accessibility. The allocentric task relies on a visible human viewpoint object rather than an arbitrary language-specified reference frame. The paper does not claim real-world annotation coverage, direct policy evaluation, full compositional spatial language reasoning, object permanence, or dynamics. It also does not report explicit viewpoint-angle sweep experiments, systematic object-appearance transfer studies, distractor-count scaling curves, synthetic-to-real transfer experiments, or corruption robustness tests.

These limitations delimit the interpretation of the benchmark. SpaRRTa is not a complete embodied-intelligence suite, nor a general-purpose spatial language benchmark. Its stated role is more focused: to test whether visual foundation models encode a basic but foundational capability—where one object is relative to another from a specified perspective—and to expose how strongly that capability depends on local patch structure, probe design, environment clutter, and viewpoint transformation. Within that scope, the paper presents SpaRRTa as a diagnostic tool for future development of more spatially aware visual models (Kargin et al., 16 Jan 2026).