SpatialEvo: Self-Evolving 3D Spatial Intelligence

Updated 4 July 2026

SpatialEvo is a self-evolving framework for 3D spatial reasoning that converts unannotated scenes into exact geometric oracles using deterministic supervision.
It employs a deterministic geometric environment built from point clouds and calibrated camera poses to deliver zero-noise, objective ground truth for spatial tasks.
The framework co-evolves questioner and solver roles with a task-adaptive scheduler, achieving leading benchmark performance in precise spatial reasoning.

Searching arXiv for the named SpatialEvo paper and closely related spatial-evolution work to ground the article with current citations. arxiv_search(query="SpatialEvo Self-Evolving Spatial Intelligence via Deterministic Geometric Environments", max_results=5) arxiv_search(query="SpatialEvo Self-Evolving Spatial Intelligence via Deterministic Geometric Environments", max_results=10, sort_by="relevance") arxiv_search("SpatialEvo Self-Evolving Spatial Intelligence via Deterministic Geometric Environments") SpatialEvo denotes, in its most specific current usage, a self-evolving framework for 3D spatial reasoning centered on the Deterministic Geometric Environment (DGE). The framework is motivated by the observation that, for 3D spatial questions, ground truth is a deterministic consequence of the underlying geometry and can be computed exactly from point clouds and camera poses without model involvement. DGE therefore converts unannotated 3D scenes into zero-noise interactive oracles, replacing pseudo-labels derived from model consensus with objective physical feedback. A single shared-parameter policy co-evolves across questioner and solver roles, while a task-adaptive scheduler concentrates training on the model’s weakest categories. Across nine benchmarks, SpatialEvo achieves the highest average score at both 3B and 7B scales, with consistent gains on spatial reasoning benchmarks and no degradation on general visual understanding (Li et al., 15 Apr 2026).

1. Concept and problem setting

SpatialEvo is designed for 3D spatial reasoning over multi-view indoor scenes. In the formulation used by the framework, a scene $\mathcal{S}$ is represented by multi-view RGB images $\mathcal{I}=\{I_1,\dots,I_T\}$ together with dense point clouds, semantic labels, and calibrated camera poses. The policy model $\pi_\theta$ receives only the RGB images, whereas the geometric assets remain available to the environment for validation and exact supervision (Li et al., 15 Apr 2026).

The motivating bottleneck is geometric annotation. The framework states that writing high-quality spatial questions and verifying their answers in 3D is expensive, and that static supervised corpora cannot adapt their question distribution to the model’s current weaknesses. Existing self-evolving paradigms are described as inadequate for this setting because they rely on model-generated pseudo-labels such as majority voting and self-consistency, which can reinforce systematic geometric errors rather than correct them. SpatialEvo addresses this by shifting supervision from model consensus to deterministic geometry (Li et al., 15 Apr 2026).

The central claim is therefore methodological rather than merely architectural: the true answer to a spatial question is treated as $f(\mathcal{S},q)=a^*$ , where $a^*$ is computed algorithmically from the scene geometry. This makes each scene an interactive oracle and allows continual improvement without manual annotation or consensus-derived labels (Li et al., 15 Apr 2026).

2. Deterministic Geometric Environment

DGE is a deterministic, programmatic environment that encodes 3D scenes, validates natural-language questions, and computes exact answers for 16 spatial tasks. Its validation rule is written as

$\mathrm{Valid}(Q,t,x) = \mathbb{I}\!\left[ \mathcal{C}_{\mathrm{mode}} \wedge \mathcal{C}_{\mathrm{extract}} \wedge \mathcal{C}_{\mathrm{pool}} \wedge \mathcal{C}_{\mathrm{schema}} \wedge \mathcal{C}_{\mathrm{solver}} \right],$

where the constraints enforce modality compatibility, successful entity extraction, valid grounded pools, structural consistency, and executability of the geometric solver. DGE is constructed from raw 3D datasets, specifically ScanNet, ScanNet++, and ARKitScenes, by building scene summaries $\Sigma_s$ , grounded entity pools, and geometric toolkits for distance, projections, bounding boxes, and camera transforms (Li et al., 15 Apr 2026).

The framework groups its 16 tasks into scene-level multi-image, single-image, and two-image categories.

Group	Count	Tasks
Scene-level multi-image	6	Object Counting; Object Size; Absolute Distance; Relative Distance; Relative Direction; Room Size Estimation
Single-image	3	Single-View Relative Direction; Camera–Object Distance; Depth Ordering
Two-image	7	Inter-Camera Relative Position; Inter-Camera Elevation; Visibility Comparison; Camera–Object Position; Camera–Region Position; Camera Motion Estimation; Attribute Measurement

The task definitions are geometric rather than semantic. Object size is computed from a tight 3D bounding box, absolute distance from object point sets or bounding boxes, camera relative pose from calibrated extrinsics, and visibility or depth order from camera-frame projection and depth comparison. The framework also defines feasibility sets

$\mathcal{T}^{\mathrm{feasible}}_s = \{k\in\mathcal{T}\mid \phi_k(\Sigma_s)=1\},$

so task availability is inferred directly from scene content and pose availability rather than imposed externally (Li et al., 15 Apr 2026).

3. Co-evolving questioner and solver

SpatialEvo uses a single VLM $\pi_\theta$ in two roles. The questioner receives a task assignment, scene context, and a task-specific validity guide, and outputs $\mathcal{T}^{\mathrm{feasible}}_s = \{k\in\mathcal{T}\mid \phi_k(\Sigma_s)=1\},$ 2 DGE then parses the question, extracts structured entities using a small text-only LLM, evaluates $\mathrm{Valid}(Q,t,x)$ , and either returns exact ground truth or an invalidation reason. The solver receives the same images plus the question and produces answers or, for invalid questions, an explanation of invalidity (Li et al., 15 Apr 2026).

Optimization is performed with GRPO. For a rollout group, the normalized advantage is

$\mathcal{I}=\{I_1,\dots,I_T\}$ 0

The same parameter vector $\mathcal{I}=\{I_1,\dots,I_T\}$ 1 is updated from questioner and solver trajectories. The questioner reward is

$\mathcal{I}=\{I_1,\dots,I_T\}$ 2

with $\mathcal{I}=\{I_1,\dots,I_T\}$ 3 for severe structural failure. The solver reward is

$\mathcal{I}=\{I_1,\dots,I_T\}$ 4

This design makes invalid questions part of the learning signal rather than discarded noise (Li et al., 15 Apr 2026).

Curriculum is produced by a task-adaptive scheduler. Historical task performance is smoothed as

$\mathcal{I}=\{I_1,\dots,I_T\}$ 5

with weights

$\mathcal{I}=\{I_1,\dots,I_T\}$ 6

and task probabilities

$\mathcal{I}=\{I_1,\dots,I_T\}$ 7

For numeric tasks, scheduler accuracy is calibrated by

$\mathcal{I}=\{I_1,\dots,I_T\}$ 8

with $\mathcal{I}=\{I_1,\dots,I_T\}$ 9 for numeric tasks and $\pi_\theta$ 0 otherwise. A semantic-signature deduplication step prevents repeated equivalent questions from inflating solver supervision (Li et al., 15 Apr 2026).

4. Empirical performance and ablation structure

SpatialEvo is evaluated on VSI-Bench, EmbSpatial, ViewSpatial, RealWorldQA, V-STAR, SpatialViz, STARE, CoreCognition, and MMStar. The principal comparison is against the corresponding Qwen2.5-VL baseline at 3B and 7B scales.

Scale	Baseline average	SpatialEvo average
3B	47.5	51.1
7B	52.1	54.7

At 3B, the reported gains include VSI-Bench $\pi_\theta$ 1, EmbSpatial $\pi_\theta$ 2, ViewSpatial $\pi_\theta$ 3, RealWorldQA $\pi_\theta$ 4, STARE $\pi_\theta$ 5, SpatialViz $\pi_\theta$ 6, V-STAR $\pi_\theta$ 7, CoreCognition $\pi_\theta$ 8, and MMStar $\pi_\theta$ 9. At 7B, the reported changes include VSI-Bench $f(\mathcal{S},q)=a^*$ 0, EmbSpatial $f(\mathcal{S},q)=a^*$ 1, ViewSpatial $f(\mathcal{S},q)=a^*$ 2, SpatialViz $f(\mathcal{S},q)=a^*$ 3, CoreCognition $f(\mathcal{S},q)=a^*$ 4, and MMStar $f(\mathcal{S},q)=a^*$ 5, with RealWorldQA $f(\mathcal{S},q)=a^*$ 6, V-STAR $f(\mathcal{S},q)=a^*$ 7, and STARE $f(\mathcal{S},q)=a^*$ 8 remaining near baseline. The paper characterizes the overall pattern as strong gains on spatial reasoning benchmarks with no degradation on general visual understanding on average (Li et al., 15 Apr 2026).

The VSI-Bench paradigm comparison is particularly diagnostic. Under a restricted setting using only ScanNet and six tasks, SpatialEvo Online RL reaches an average of $f(\mathcal{S},q)=a^*$ 9, compared with SpatialLadder RL at $a^*$ 0, SFT on SpatialLadder data at $a^*$ 1, SFT on SpaceR data at $a^*$ 2, SFT on SpatialSSRL data at $a^*$ 3, and SFT on SpatialEvo’s own offline data at $a^*$ 4. This is the main empirical argument for dynamic online supervision through DGE rather than static data generation alone (Li et al., 15 Apr 2026).

Ablations identify physical grounding as the decisive component. Removing DGE ground truth and replacing it with majority-vote pseudo-ground truth reduces the 7B average from $a^*$ 5 to $a^*$ 6, and VSI-Bench drops from $a^*$ 7 to $a^*$ 8. Removing the solver yields $a^*$ 9, removing the questioner $\mathrm{Valid}(Q,t,x) = \mathbb{I}\!\left[ \mathcal{C}_{\mathrm{mode}} \wedge \mathcal{C}_{\mathrm{extract}} \wedge \mathcal{C}_{\mathrm{pool}} \wedge \mathcal{C}_{\mathrm{schema}} \wedge \mathcal{C}_{\mathrm{solver}} \right],$ 0, removing the adaptive scheduler $\mathrm{Valid}(Q,t,x) = \mathbb{I}\!\left[ \mathcal{C}_{\mathrm{mode}} \wedge \mathcal{C}_{\mathrm{extract}} \wedge \mathcal{C}_{\mathrm{pool}} \wedge \mathcal{C}_{\mathrm{schema}} \wedge \mathcal{C}_{\mathrm{solver}} \right],$ 1, removing the validity reward $\mathrm{Valid}(Q,t,x) = \mathbb{I}\!\left[ \mathcal{C}_{\mathrm{mode}} \wedge \mathcal{C}_{\mathrm{extract}} \wedge \mathcal{C}_{\mathrm{pool}} \wedge \mathcal{C}_{\mathrm{schema}} \wedge \mathcal{C}_{\mathrm{solver}} \right],$ 2, and removing the explanation reward $\mathrm{Valid}(Q,t,x) = \mathbb{I}\!\left[ \mathcal{C}_{\mathrm{mode}} \wedge \mathcal{C}_{\mathrm{extract}} \wedge \mathcal{C}_{\mathrm{pool}} \wedge \mathcal{C}_{\mathrm{schema}} \wedge \mathcal{C}_{\mathrm{solver}} \right],$ 3. The largest degradation therefore comes from abandoning deterministic geometric supervision (Li et al., 15 Apr 2026).

5. Assumptions, limitations, and extensions

SpatialEvo depends on high-fidelity 3D assets: dense indoor point clouds, reliable calibrated poses, and sufficiently complete scene coverage. The framework is therefore described as currently limited to static indoor environments such as ScanNet, ScanNet++, and ARKitScenes. Outdoor scenes, moving objects, and settings with unreliable geometry are not treated as straightforward extensions because deterministic geometric ground truth becomes unreliable in those regimes (Li et al., 15 Apr 2026).

Two practical sensitivities are emphasized. First, entity extraction is delegated to a text-only LLM, so ambiguous or underspecified questions can be mis-parsed even when the geometry is exact. Second, point-cloud sparsity, reconstruction noise, and occlusion directly affect bounding boxes, distances, and depth estimates. The framework mitigates some of this through validation rules and relative-error tolerance bands, but it does not remove data-induced uncertainty (Li et al., 15 Apr 2026).

The stated future directions are to reduce dependence on explicit point clouds, extend DGE toward richer physical reasoning and dynamics, scale to larger models and more diverse environments, and apply the same principle—environmental determinism rather than model consensus—to other embodied tasks. The conceptual claim is that whenever ground truth is algorithmically computable from the environment, self-improvement should use that determinism directly (Li et al., 15 Apr 2026).

6. Broader research uses of “SpatialEvo”

The supplied literature also uses “SpatialEvo” as a descriptive label for a broader family of spatially explicit evolutionary programs. In spatial evolutionary games, one line of work formulates a mean-field measure-valued dynamics on $\mathrm{Valid}(Q,t,x) = \mathbb{I}\!\left[ \mathcal{C}_{\mathrm{mode}} \wedge \mathcal{C}_{\mathrm{extract}} \wedge \mathcal{C}_{\mathrm{pool}} \wedge \mathcal{C}_{\mathrm{schema}} \wedge \mathcal{C}_{\mathrm{solver}} \right],$ 4, with positions evolving by $\mathrm{Valid}(Q,t,x) = \mathbb{I}\!\left[ \mathcal{C}_{\mathrm{mode}} \wedge \mathcal{C}_{\mathrm{extract}} \wedge \mathcal{C}_{\mathrm{pool}} \wedge \mathcal{C}_{\mathrm{schema}} \wedge \mathcal{C}_{\mathrm{solver}} \right],$ 5, mixed strategies evolving by a replicator dynamics $\mathrm{Valid}(Q,t,x) = \mathbb{I}\!\left[ \mathcal{C}_{\mathrm{mode}} \wedge \mathcal{C}_{\mathrm{extract}} \wedge \mathcal{C}_{\mathrm{pool}} \wedge \mathcal{C}_{\mathrm{schema}} \wedge \mathcal{C}_{\mathrm{solver}} \right],$ 6, and the population state satisfying the nonlinear continuity equation $\mathrm{Valid}(Q,t,x) = \mathbb{I}\!\left[ \mathcal{C}_{\mathrm{mode}} \wedge \mathcal{C}_{\mathrm{extract}} \wedge \mathcal{C}_{\mathrm{pool}} \wedge \mathcal{C}_{\mathrm{schema}} \wedge \mathcal{C}_{\mathrm{solver}} \right],$ 7. That framework proves equivalence of Lagrangian and Eulerian formulations, existence, uniqueness, stability, and the $\mathrm{Valid}(Q,t,x) = \mathbb{I}\!\left[ \mathcal{C}_{\mathrm{mode}} \wedge \mathcal{C}_{\mathrm{extract}} \wedge \mathcal{C}_{\mathrm{pool}} \wedge \mathcal{C}_{\mathrm{schema}} \wedge \mathcal{C}_{\mathrm{solver}} \right],$ 8 mean-field limit (Ambrosio et al., 2018). A related derivation takes stochastic spatial evolutionary games on lattices to deterministic integro-differential equations, recovering mean-field replicator ODEs when interaction is spatially uniform and identifying traveling waves, standing waves, and pattern formation in the spatial case (Hwang et al., 2010). In a deterministic spatial Prisoner’s Dilemma on a 2D lattice, simulations show a sharp change in cooperator density near $\mathrm{Valid}(Q,t,x) = \mathbb{I}\!\left[ \mathcal{C}_{\mathrm{mode}} \wedge \mathcal{C}_{\mathrm{extract}} \wedge \mathcal{C}_{\mathrm{pool}} \wedge \mathcal{C}_{\mathrm{schema}} \wedge \mathcal{C}_{\mathrm{solver}} \right],$ 9 together with cluster boundaries whose Minkowski dimension tends to $\Sigma_s$ 0, making them asymptotically space filling (Kolotev et al., 2017). At a more abstract level, incentive, adaptive, and time-scale dynamics on products of simplices provide a multipopulation geometric framework in which KL divergence, escort divergences, and more general metric divergences act as Lyapunov functions for broad classes of evolutionary dynamics (Harper et al., 2012).

In population genetics and contagion, the same descriptive label covers models that tie space to genealogy and spread. A phylogenetically modulated spatiotemporal Hawkes process couples case-specific contagion factors $\Sigma_s$ 1 to Brownian motion on a viral phylogeny, yielding a joint model of spatial contagion and viral evolution; in the 2014–2016 West Africa Ebola analysis, it is fit to 23,422 cases and identifies 177 viruses with $\Sigma_s$ 2 credible intervals for $\Sigma_s$ 3 entirely above 1 and 6 entirely below 1 (Holbrook et al., 2021). A review of spatial population genetics places the spatial pedigree at center stage, emphasizing density $\Sigma_s$ 4, effective dispersal $\Sigma_s$ 5, neighborhood size $\Sigma_s$ 6, and the temporal layering of relatedness in recent and deep ancestry (Bradburd et al., 2019). In the spatial $\Sigma_s$ 7-Fleming–Viot setting, the tree-generating process in a small-radius, high-rate regime is shown by simulation to be well approximated by a birth–death model, whereas Brownian motions along birth–death trees fail to reproduce long-run habitat-boundary effects in lineage locations (Wirtz et al., 2023).

Other supplied uses of the label cover pre-biotic evolution, cancer, ecology, statistics, and numerical dynamics. A spatial population of $\Sigma_s$ 8-machines self-organizes into spacetime-invariant autocatalytic domains separated by membrane replicators that translate between domains (Piantadosi et al., 2010). A geometric model of solid tumor growth with driver mutations yields analytic formulas for clonal boundaries, capture time $\Sigma_s$ 9, and the finite asymptotic volume of the original clone once it is enveloped by faster mutants (Antal et al., 2013). A spatial ecological birth–death process in $\mathcal{T}^{\mathrm{feasible}}_s = \{k\in\mathcal{T}\mid \phi_k(\Sigma_s)=1\},$ 0 with dispersal and competition is analyzed through correlation hierarchies in Banach scales, proving global evolution of sub-Poissonian states under a stability-like condition $\mathcal{T}^{\mathrm{feasible}}_s = \{k\in\mathcal{T}\mid \phi_k(\Sigma_s)=1\},$ 1 (Kondratiev et al., 2015). OG-SPACE provides an optimized Gillespie framework for spatial cancer evolution on arbitrary lattices, returning spatial snapshots, sampled-cell phylogeny, mutational tree, VAF spectra, and single-cell genotypes (Angaroni et al., 2021). In global spatio-temporal statistics, an evolutionary spectrum model on the sphere uses land–ocean descriptors to produce nonstationary covariance structure and fast surrogate generation for climate ensembles larger than 20 million points (Castruccio et al., 2015). In Hamiltonian beam dynamics, a symplectic, symmetric, second-order scheme advances particles with space rather than time as the independent variable, preserving canonical structure while operating on a fixed spatial mesh (Ruzzon et al., 2010).

Taken together, these usages suggest that “SpatialEvo” now names both a specific self-evolving 3D reasoning framework and a wider research idiom: the treatment of evolution, adaptation, inference, or dynamics as intrinsically spatial, with geometry, locality, or habitat structure entering the definition of state, the update rule, and the observable macroscopic behavior (Li et al., 15 Apr 2026).